Comparing the geometric similarity of trajectories¶
Here we compare the geometric similarity of trajectories using the following path metrics:
the Hausdorff distance
the discrete Fréchet
Last executed: Feb 06, 2020 with MDAnalysis 0.20.2-dev0
Last updated: January 2020
Minimum version of MDAnalysis: 0.18.0
Packages required:
Optional packages for visualisation:
Note
The metrics and methods in the psa
path similarity analysis module are from ([SKTB15]). Please cite them when using the MDAnalysis.analysis.psa
module in published work.
[1]:
import MDAnalysis as mda
from MDAnalysis.tests.datafiles import (PSF, DCD, DCD2, GRO, XTC,
PSF_NAMD_GBIS, DCD_NAMD_GBIS,
PDB_small, CRD)
from MDAnalysis.analysis import psa
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Loading files¶
The test files we will be working with here feature adenylate kinase (AdK), a phosophotransferase enzyme. ([BDPW09])
[2]:
u1 = mda.Universe(PSF, DCD)
u2 = mda.Universe(PSF, DCD2)
u3 = mda.Universe(GRO, XTC)
u4 = mda.Universe(PSF_NAMD_GBIS, DCD_NAMD_GBIS)
u5 = mda.Universe(PDB_small, CRD, PDB_small,
CRD, PDB_small, CRD, PDB_small)
ref = mda.Universe(PDB_small)
labels = ['DCD', 'DCD2', 'XTC', 'NAMD', 'mixed']
The trajectories can have different lengths, as seen below.
[3]:
print(len(u1.trajectory), len(u2.trajectory), len(u3.trajectory))
98 102 10
Aligning trajectories¶
We set up the PSAnalysis with our list of Universes and labels. While path_select
determines which atoms to calculate the path similarities for, ref_select
determines which atoms to use to align each Universe to reference
.
[4]:
CORE_sel = 'name CA and (resid 1:29 or resid 60:121 or resid 160:214)'
ps = psa.PSAnalysis([u1, u2, u3, u4, u5],
labels=labels,
reference=ref,
ref_select=CORE_sel,
path_select='name CA')
Generating paths¶
For each Universe, we will generate a transition path containing each conformation from a trajectory.
First, we will do a mass-weighted alignment of each trajectory to the reference structure reference
, along the atoms ref_select
. To turn off the mass weighting, set weights=None
. If your trajectories are already aligned, you can skip the alignment with align=False
.
[5]:
ps.generate_paths(align=True, save=False, weights='mass')
Hausdorff method¶
Now we can compute the similarity of each path. The default metric is to use the Hausdorff method. [5] The Hausdorff distance between two conformation transition paths \(P\) and \(Q\) is:
\(\delta_h(P|Q)\) is the directed Hausdorff distance from \(P\) to \(Q\), and is defined as:
The directed Hausdorff distance of \(P\) to \(Q\) is the distance between the two points, \(p \in P\) and its structural nearest neighbour \(q \in Q\), for the point \(p\) where the distance is greatest. This is not commutative, i.e. the directed Hausdorff distance from \(Q\) to \(P\) is not the same. (See scipy.spatial.distance.directed_hausdorff for more information).
In MDAnalysis, the Hausdorff distance is the RMSD between a pair of conformations in \(P\) and \(Q\), where the one of the conformations in the pair has the least similar nearest neighbour.
[6]:
ps.run(metric='hausdorff')
/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/analysis/psa.py:1556: DeprecationWarning: `save_result` is deprecated!
`save_result` will be removed in release 1.0.0.
You can save the distance matrix :attr:`D` to a numpy file with ``np.save(filename, PSAnalysis.D)``.
self.save_result(filename=filename)
The distance matrix is saved in ps.D
.
[7]:
ps.D
[7]:
array([[ 0. , 1.33312648, 22.37206002, 2.04737477, 7.55204678],
[ 1.33312648, 0. , 22.3991666 , 2.07957562, 7.55032598],
[22.37206002, 22.3991666 , 0. , 22.42282661, 25.74534554],
[ 2.04737477, 2.07957562, 22.42282661, 0. , 7.67052252],
[ 7.55204678, 7.55032598, 25.74534554, 7.67052252, 0. ]])
Plotting¶
psa.PSAnalysis
provides two convenience methods for plotting this data. The first is to plot a heat-map dendrogram from clustering the trajectories based on their path similarity. You can use any clustering method supported by scipy.cluster.hierarchy.linkage; the default is ‘ward’.
[8]:
ps.plot(linkage='ward')
[8]:
(array([[ 0. , 1. , 1.33312648, 2. ],
[ 3. , 5. , 2.25503365, 3. ],
[ 4. , 6. , 9.20452463, 4. ],
[ 2. , 7. , 29.13448037, 5. ]]),
{'icoord': [[35.0, 35.0, 45.0, 45.0],
[25.0, 25.0, 40.0, 40.0],
[15.0, 15.0, 32.5, 32.5],
[5.0, 5.0, 23.75, 23.75]],
'dcoord': [[0.0, 1.3331264831939273, 1.3331264831939273, 0.0],
[0.0, 2.2550336453918844, 2.2550336453918844, 1.3331264831939273],
[0.0, 9.204524628710315, 9.204524628710315, 2.2550336453918844],
[0.0, 29.134480368642226, 29.134480368642226, 9.204524628710315]],
'ivl': ['2', '4', '3', '0', '1'],
'leaves': [2, 4, 3, 0, 1],
'color_list': ['g', 'g', 'b', 'b']},
array([[ 0. , 25.74534554, 22.42282661, 22.37206002, 22.3991666 ],
[25.74534554, 0. , 7.67052252, 7.55204678, 7.55032598],
[22.42282661, 7.67052252, 0. , 2.04737477, 2.07957562],
[22.37206002, 7.55204678, 2.04737477, 0. , 1.33312648],
[22.3991666 , 7.55032598, 2.07957562, 1.33312648, 0. ]]))
<Figure size 432x288 with 0 Axes>
The other is to plot a heatmap annotated with the distance values. Again, the trajectories are displayed in an arrangement that fits the clustering method.
Note
You will need to install the data visualisation library Seaborn for this function.
[9]:
ps.plot_annotated_heatmap(linkage='single')
[9]:
(array([[ 0. , 1. , 1.33312648, 2. ],
[ 3. , 5. , 2.04737477, 3. ],
[ 4. , 6. , 7.55032598, 4. ],
[ 2. , 7. , 22.37206002, 5. ]]),
{'icoord': [[35.0, 35.0, 45.0, 45.0],
[25.0, 25.0, 40.0, 40.0],
[15.0, 15.0, 32.5, 32.5],
[5.0, 5.0, 23.75, 23.75]],
'dcoord': [[0.0, 1.3331264831939273, 1.3331264831939273, 0.0],
[0.0, 2.047374774767044, 2.047374774767044, 1.3331264831939273],
[0.0, 7.550325981004795, 7.550325981004795, 2.047374774767044],
[0.0, 22.372060021101248, 22.372060021101248, 7.550325981004795]],
'ivl': ['2', '4', '3', '0', '1'],
'leaves': [2, 4, 3, 0, 1],
'color_list': ['g', 'g', 'b', 'b']},
array([[ 0. , 25.74534554, 22.42282661, 22.37206002, 22.3991666 ],
[25.74534554, 0. , 7.67052252, 7.55204678, 7.55032598],
[22.42282661, 7.67052252, 0. , 2.04737477, 2.07957562],
[22.37206002, 7.55204678, 2.04737477, 0. , 1.33312648],
[22.3991666 , 7.55032598, 2.07957562, 1.33312648, 0. ]]))
<Figure size 432x288 with 0 Axes>
Discrete Fréchet distances¶
The discrete Fréchet distance between two conformation transition paths \(P\) and \(Q\) is:
where \(C\) is a coupling in the set of all couplings \(\Gamma_{P,Q}\) between \(P\) and \(Q\). A coupling \(C(P,Q)\) is a sequence of pairs of conformations in \(P\) and \(Q\), where the first/last pairs are the first/last points of the respective paths, and for each successive pair, at least one point in \(P\) or \(Q\) must advance to the next frame.
The coupling distance \(\|C\|\) is the largest distance between a pair of points in such a sequence.
In MDAnalysis, the discrete Fréchet distance is the lowest possible RMSD between a conformation from \(P\) and a conformation from \(Q\), where the two frames are at similar points along the trajectory, and they are the least structurally similar in that particular coupling sequence. [6-9]
[10]:
ps.run(metric='discrete_frechet')
ps.D
/Users/lily/anaconda3/envs/mdanalysis/lib/python3.7/site-packages/MDAnalysis/analysis/psa.py:1556: DeprecationWarning: `save_result` is deprecated!
`save_result` will be removed in release 1.0.0.
You can save the distance matrix :attr:`D` to a numpy file with ``np.save(filename, PSAnalysis.D)``.
self.save_result(filename=filename)
[10]:
array([[ 0. , 1.33312649, 22.37205967, 2.04737475, 7.55204694],
[ 1.33312649, 0. , 22.39916723, 2.07957565, 7.55032613],
[22.37205967, 22.39916723, 0. , 22.42282569, 25.74534511],
[ 2.04737475, 2.07957565, 22.42282569, 0. , 7.67052241],
[ 7.55204694, 7.55032613, 25.74534511, 7.67052241, 0. ]])
Plotting¶
[11]:
ps.plot(linkage='ward')
[11]:
(array([[ 0. , 1. , 1.33312649, 2. ],
[ 3. , 5. , 2.25503365, 3. ],
[ 4. , 6. , 9.20452471, 4. ],
[ 2. , 7. , 29.13448001, 5. ]]),
{'icoord': [[35.0, 35.0, 45.0, 45.0],
[25.0, 25.0, 40.0, 40.0],
[15.0, 15.0, 32.5, 32.5],
[5.0, 5.0, 23.75, 23.75]],
'dcoord': [[0.0, 1.3331264917717554, 1.3331264917717554, 0.0],
[0.0, 2.2550336465704057, 2.2550336465704057, 1.3331264917717554],
[0.0, 9.204524708552725, 9.204524708552725, 2.2550336465704057],
[0.0, 29.134480014437507, 29.134480014437507, 9.204524708552725]],
'ivl': ['2', '4', '3', '0', '1'],
'leaves': [2, 4, 3, 0, 1],
'color_list': ['g', 'g', 'b', 'b']},
array([[ 0. , 25.74534511, 22.42282569, 22.37205967, 22.39916723],
[25.74534511, 0. , 7.67052241, 7.55204694, 7.55032613],
[22.42282569, 7.67052241, 0. , 2.04737475, 2.07957565],
[22.37205967, 7.55204694, 2.04737475, 0. , 1.33312649],
[22.39916723, 7.55032613, 2.07957565, 1.33312649, 0. ]]))
<Figure size 432x288 with 0 Axes>
[12]:
ps.plot_annotated_heatmap(linkage='single')
[12]:
(array([[ 0. , 1. , 1.33312649, 2. ],
[ 3. , 5. , 2.04737475, 3. ],
[ 4. , 6. , 7.55032613, 4. ],
[ 2. , 7. , 22.37205967, 5. ]]),
{'icoord': [[35.0, 35.0, 45.0, 45.0],
[25.0, 25.0, 40.0, 40.0],
[15.0, 15.0, 32.5, 32.5],
[5.0, 5.0, 23.75, 23.75]],
'dcoord': [[0.0, 1.3331264917717554, 1.3331264917717554, 0.0],
[0.0, 2.047374750604888, 2.047374750604888, 1.3331264917717554],
[0.0, 7.550326126269361, 7.550326126269361, 2.047374750604888],
[0.0, 22.37205966687729, 22.37205966687729, 7.550326126269361]],
'ivl': ['2', '4', '3', '0', '1'],
'leaves': [2, 4, 3, 0, 1],
'color_list': ['g', 'g', 'b', 'b']},
array([[ 0. , 25.74534511, 22.42282569, 22.37205967, 22.39916723],
[25.74534511, 0. , 7.67052241, 7.55204694, 7.55032613],
[22.42282569, 7.67052241, 0. , 2.04737475, 2.07957565],
[22.37205967, 7.55204694, 2.04737475, 0. , 1.33312649],
[22.39916723, 7.55032613, 2.07957565, 1.33312649, 0. ]]))
<Figure size 432x288 with 0 Axes>
References¶
[1] Oliver Beckstein, Elizabeth J. Denning, Juan R. Perilla, and Thomas B. Woolf. Zipping and Unzipping of Adenylate Kinase: Atomistic Insights into the Ensemble of Open↔Closed Transitions. Journal of Molecular Biology, 394(1):160–176, November 2009. 00107. URL: https://linkinghub.elsevier.com/retrieve/pii/S0022283609011164, doi:10.1016/j.jmb.2009.09.009.
[2] Richard J. Gowers, Max Linke, Jonathan Barnoud, Tyler J. E. Reddy, Manuel N. Melo, Sean L. Seyler, Jan Domański, David L. Dotson, Sébastien Buchoux, Ian M. Kenney, and Oliver Beckstein. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. Proceedings of the 15th Python in Science Conference, pages 98–105, 2016. 00152. URL: https://conference.scipy.org/proceedings/scipy2016/oliver_beckstein.html, doi:10.25080/Majora-629e541a-00e.
[3] Naveen Michaud-Agrawal, Elizabeth J. Denning, Thomas B. Woolf, and Oliver Beckstein. MDAnalysis: A toolkit for the analysis of molecular dynamics simulations. Journal of Computational Chemistry, 32(10):2319–2327, July 2011. 00778. URL: http://doi.wiley.com/10.1002/jcc.21787, doi:10.1002/jcc.21787.
[4] Sean L. Seyler, Avishek Kumar, M. F. Thorpe, and Oliver Beckstein. Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways. PLOS Computational Biology, 11(10):e1004568, October 2015. URL: https://dx.plos.org/10.1371/journal.pcbi.1004568, doi:10.1371/journal.pcbi.1004568.