Molecular Shape, Clustering & Decomposition¶
This page is a self-contained, textbook-style introduction to MolPy's descriptor operators: the shape tensors that summarize how a molecule or aggregate is spread out in space, the clustering tools that find those aggregates in the first place, and the PCA / k-means primitives for organizing large sets of descriptors. The canonical applications are polymer coil/globule analysis, micelle and aggregate detection, and reducing a trajectory to a handful of interpretable structural coordinates.
As with every compute operator, the per-particle reductions run in Rust
(molrs); MolPy supplies frames, the cluster assignment, and per-particle masses,
and returns the native result. A key structural point: the shape descriptors are
defined per cluster, so the usual workflow is find clusters → reduce each
cluster to a tensor.
Conventions used throughout
- Length is in Å; the gyration radius \(R_g\) is in Å, the gyration tensor in Ų.
- A cluster is a group of particles — a single molecule, or a physically
connected aggregate found by
Cluster. - Tensors are \(3\times3\); their eigenvalues are the principal values.
1. The gyration tensor measures molecular size¶
The size and shape of a group of particles are captured by the gyration tensor, the mean outer product of positions about the group's center \(\mathbf r_\mathrm{c}\):
Its trace is the squared radius of gyration, the most common single measure of coil size:
For a polymer chain \(R_g\) scales as \(N^\nu\) with the Flory exponent \(\nu\), so its dependence on chain length distinguishes a swollen coil (\(\nu\approx 3/5\)) from a collapsed globule (\(\nu = 1/3\)).
2. Shape anisotropy comes from the gyration-tensor eigenvalues¶
Diagonalize \(S\) to get principal values \(\lambda_1\le\lambda_2\le\lambda_3\). Their combinations are rotation-invariant shape descriptors:
\(\kappa^2 = 0\) for a sphere (or a perfectly symmetric arrangement) and \(\to 1\) for a rod. These are how you tell a spherical micelle from a worm-like one without ever looking at a picture.
3. The inertia tensor gives the principal axes¶
The mass-weighted cousin of the gyration tensor is the moment-of-inertia tensor
whose eigenvectors are the principal axes of the body and whose eigenvalues order them from the long axis (smallest \(I\)) to the short. Use it to define a molecular frame — for orientational analyses, or to align structures for averaging.
4. Computing shape descriptors¶
Shape operators consume a cluster assignment and a per-cluster center, so they chain after a neighbor list and a clustering step. Each particle's molecule (or a connected aggregate) is one cluster.
from molpy.compute import (
NeighborList, Cluster, CenterOfMass, RadiusOfGyration,
GyrationTensor, InertiaTensor,
)
nlist = NeighborList(cutoff=1.6)(frame)
clusters = Cluster(min_cluster_size=10)([frame], [nlist]) # one ClusterResult / frame
com = CenterOfMass(masses)([frame], clusters) # mass-weighted centers
rg = RadiusOfGyration(masses)([frame], clusters, com) # R_g per cluster
S = GyrationTensor()([frame], clusters, com) # 3×3 tensor per cluster
I = InertiaTensor(masses)([frame], clusters, com) # inertia tensor per cluster
Pass masses=None to fall back to unit mass (geometric, unweighted descriptors).
5. Finding aggregates with Cluster¶
Before you can describe an aggregate you have to find it. Cluster builds a
connectivity graph from a NeighborList and returns the connected components
larger than min_cluster_size — micelles, droplets, percolating networks. The
companion ClusterProperties reduces each cluster to its size, center, mass,
gyration tensor, and \(R_g\) in one call.
from molpy.compute import ClusterProperties
clusters = Cluster(min_cluster_size=20)([frame], [nlist])
props = ClusterProperties()([frame], clusters) # one dict of per-cluster properties / frame
The neighbor cutoff is the physical definition of "connected", so choose it from the first minimum of \(g(r)\) (see the structural guide).
6. PCA reduces a descriptor set to its dominant variations¶
A trajectory analyzed with the operators above yields a high-dimensional table —
one row of descriptors per configuration. Principal component analysis
re-expresses that table in the orthogonal directions of greatest variance (the
eigenvectors of the covariance matrix), so that the first two components usually
capture the dominant structural motion. Pca projects a list of DescriptorRow
vectors onto its two leading components:
from molpy.compute import Pca, DescriptorRow
rows = [DescriptorRow(r) for r in descriptor_matrix] # each r is a 1-D float array
pca = Pca()(rows) # 2-component projection
Scale or standardize the descriptors before PCA — otherwise a single large-magnitude column dominates the variance and the components are meaningless.
7. K-means groups configurations into states¶
Given the reduced coordinates, k-means partitions them into \(k\) clusters by iteratively assigning each point to its nearest centroid and recomputing the centroids (Lloyd's algorithm). It is the simplest way to turn a continuous PCA map into discrete structural states — folded vs unfolded, paired vs free.
\(k\) is a modelling choice, not an output: try several and check that the clusters are stable and physically interpretable.
8. Pitfalls checklist¶
- Mass weighting → \(R_g\) and the inertia tensor change with the mass
convention; pass real masses for physical principal axes,
Nonefor geometric. - Periodic images → a molecule split across a box boundary has a nonsense \(R_g\). Unwrap whole molecules (minimum-image relative to the cluster center) before computing shape.
- Cluster cutoff → too large merges distinct aggregates, too small fragments one; read it off \(g(r)\) and check the cluster-size distribution is stable.
- Unscaled features before PCA/k-means → standardize columns; otherwise units dominate the result.
- Reading too much into \(k\) → k-means always returns \(k\) clusters even when the data have none; validate against a held-out metric or the PCA scatter.
9. References¶
- M. Rubinstein, R. H. Colby, Polymer Physics, Oxford (2003) — radius of gyration and chain-size scaling.
- D. N. Theodorou, U. W. Suter, Macromolecules 18, 1206 (1985) — gyration tensor, asphericity, and relative shape anisotropy.
- I. T. Jolliffe, Principal Component Analysis, 2nd ed., Springer (2002).
- J. B. MacQueen, Proc. 5th Berkeley Symp. 1, 281 (1967) — k-means.
- V. Ramasubramani et al., Comput. Phys. Commun. 254, 107275 (2020) — the freud library, on which the cluster/shape kernels are modelled.
See also¶
- Structural Analysis — pick the cluster/neighbour cutoff from \(g(r)\).
- Bond-Orientational Order & Local Environment — per-particle order to cluster on.
- Compute overview — the Compute → Result pattern.
- API reference: Compute.