As an unsupervised technique for finding patterns within data, cluster analysis is one of powerful methods for determining structure populations in molecular dynamics simulations. Clustering refers to a collection that distinguishes some similar data points from other data. In the molecular simulation, clustering algorithms group similar objects into subgroups (i.e., clusters) by minimizing intra-cluster and maximizing inter-cluster differences. The clustering algorithms is developed based on the similarity or distance between objects. Clustering algorithms can be divided into partitional and hierarchical clustering approaches. Partitional clustering method divides the objects from the data set, such as conformations from an MD trajectory, into non-overlapping clusters. Hierarchical clustering method allows nested clusters and a hierarchical tree such as dendrogram. Both of partitional and hierarchical clustering are either bottom-up agglomerative approach or top-down divisive approach. Scientists simulate the protein and obtain the trajectory. For a small polypeptide chain, its conformation may change a lot during the entire simulation process. Alfa Chemistry performs the cluster analysis on all the conformations of the polypeptide chain to determine which ones are more stable and how is the distribution of each conformation.
Figure 1. Cancer drugs by similarity analysis based on the fraction of perturbed genes in clusters. (Jun, Ma.; et al. 2019)
We use partitional techniques to optimize a locally or globally defined criterion function to determine a pre-specified number of clusters, in which the squared error criterion is applied. Our groups mainly apply the K-means algorithm in which K clusters are formed by assigning each data point to its closest centroid and a new centroid for each cluster is recomputed. These steps are repeated until the cluster membership is stable. Partitional clustering tends to produce blocky clusters of similar sizes performed on MD trajectory data.
In the agglomerative hierarchical clustering method, singleton clusters join the nearest cluster until all objects are grouped into one cluster. We use centroid methods which applies the distance between the centroids of clusters to define cluster proximity. The proximity is defined as the increase in the squared error that results when two clusters are merged. At Alfa Chemistry, we use the centroid and average-linkage for analyzing MD trajectory data. In addition, we mainly apply this method in smaller data sets due to their high computational and storage requirements limit.
Our scientists use the SSR/SST ratio, the quotient of the sum of squares regression (SSR or between sum of squares) and the total sum of squares (SST) to calculate and determine the optimal number of clusters. The SSR is usually calculated via the sum of squares error (SSE or within sum of squares).
We perform cluster analysis on the different conformations of proteins according to RMSD, thereby dividing a large number of protein conformations into different categories.
DBSCAN: cluster algorithm using DBSCAN (density based)
minpoints: minimum number of points to form clusters/clusters
epsilon: cut-off distance to form clusters/clusters
sievetoframe: restore the filtered frame by comparing with all cluster/cluster frames (not just the centroid)
RMSD: the RMSD of the atom is used as the distance metric
Sieve 10: 'Sieving' is a method of reducing steps by using 'total/10' frames for initial clustering. The screened frames are then added to those clusters as an additional step
out (file): write the change in the number of clusters over time into the file
summary (file): write the summary of all cluster calculations to file
info (file): write detailed clustering results (including DBI, pSF, etc.) into file
cpopvtime (file) normframe: write the changes of cluster population over time into file frame by frame
repout (prefix) repfmt pdb: write the cluster representative to prefix.cX.fmt in pdb format
singlerepout (file) singlerepfmt netcdf: write all clusters representatives to file in NetCDF format
avgout (prefix) avgfmt restart: write the average value of all frames in each cluster to prefix.cX.fmt
Clustering analysis can be used to compare the structural populations of multiple independently run simulation results, and is applied as a criterion for determining convergence. However, there may be completely different populations in different trajectories, and some special conformations may only exist in a certain trajectory. In order to facilitate the comparison of clusters between different trajectories, we perform on two (or more) combined trajectories and cluster analysis based on the original trajectories. Our cluster analysis process is as follows:
Our cluster analysis services remarkably reduce the cost, promote further experiments, and accelerate the process of drug design for customers worldwide. Our personalized and all-around services will satisfy your innovative study demands. If you are interested in our services, please don't hesitate to contact us. We are glad to cooperate with you and witness your success!