Architecture Overview¶
MolPy is a layered toolkit with explicit data flow and minimal magic. This page is the map that every extension guide assumes: which module owns what, how the three class hierarchies of the data model fit together, and where the boundaries between Python and the molrs Rust backend run. Read it once before touching anything under Extending MolPy.
Module responsibilities¶
Each package has one clear responsibility with minimal coupling to its siblings:
| Package | Purpose |
|---|---|
core |
Data structures: Entity, Link, Struct, Atomistic, Frame, Block, Box, ForceField |
parser |
Grammar-based parsing: SMILES, SMARTS, BigSMILES, G-BigSMILES, CGSmiles |
builder |
System assembly: chain builders, virtual sites, AmberTools integration |
reacter |
Reaction framework: template-based reactions with anchor and leaving-group selectors |
typifier |
Atom typing: OPLS-AA, GAFF, custom SMARTS/SMIRKS-based typifiers |
pack |
Packing workflows: Packmol integration, density targets |
io |
File I/O: readers/writers for molecular data, trajectories, and force-field formats |
compute |
Analysis operators over Frame/Block data |
engine |
MD abstractions: LAMMPS, CP2K, OpenMM input generation and execution |
wrapper |
Subprocess boundaries to external CLI tools (antechamber, packmol, …) |
adapter |
In-memory bridges to external object models (RDKit, OpenBabel, …) |
data |
Bundled package data: force-field XML files, parameter tables |
core depends on nothing above it; everything else builds on core. compute, io, and engine operate on the tabular layer (Frame/Block); parser, builder, reacter, and typifier operate on the graph layer (Atomistic). wrapper and adapter sit at the outer edge and never leak external types into core.
The graph layer: Entity, Link, Struct¶
The editable data model has three class hierarchies:
- Entity (node) — dict-like base for atoms, beads, and particles, with identity-based hashing (
hash()isid()). Two atoms with identical properties are still different atoms. Subclasses:Atom,Bead. - Link (edge) — holds an ordered tuple of
Entityendpoints. Subclasses:Bond,Angle,Dihedral,Improper,CGBond. - Struct (container) — aggregates entities and links in
TypeBucketcollections and manages CRUD. Subclasses:Atomistic,CoarseGrain.
TypeBucket stores items by concrete type: registering Atom means bucket[Atom] returns all Atom instances, with subclasses included in parent queries. New entity or link types must be registered in the struct's __init__ — see Extending the Data Model.
The tabular layer: Block and Frame run on molrs¶
Frame and Block are re-exports of the molrs Rust column store — molpy.core.frame.Frame is molrs.Frame. Columns are typed (float / int / bool / str) and exposed as zero-copy NumPy views; a non-representable column is rejected fail-fast at write. molcrafts-molrs is a hard runtime dependency: there is no pure-Python fallback.
The graph → arrays conversion is explicit: Atomistic.to_frame() delegates to the molrs world's native to_frame(). The box is a first-class attribute (frame.box), never metadata. The molrs Backend page covers how neighbor lists, RDF, and the analysis catalog surface from Rust.
Force field: parameters apart, kernels in Rust¶
ForceField is an independent, queryable data structure — parameters are neither embedded in atoms nor derived implicitly. The model has three layers: Style (functional form), Type (parameter set for a type key), and Potential (evaluatable kernel). All energy/force kernels live in molrs (molrs-ff); the Python side exposes thin named Style subclasses and evaluation always goes through ff.to_potentials(). Adding a functional form therefore means a Rust kernel plus a Python style name plus export formatters — the exact recipe is in Extending the Force Field.
Boundary translation: the formatter hierarchy¶
Canonical field names (charge, not q; mol_id, not mol) are used everywhere inside MolPy; format-specific names exist only at the I/O boundary. The translation machinery lives in core/fields.py:
FieldSpec — canonical field definition (key, dtype, shape, doc)
↓
FieldFormatter — data field mapping: {format_key: FieldSpec}
↓ canonicalize() / localize() on Block
ForceFieldFormatter(FieldFormatter) — adds param formatters: {StyleType: Callable}
Readers call canonicalize() at exit (format → canonical); writers call localize_frame() at entry (canonical → format, on a copy). Per-format subclasses live in their own I/O module, and __init_subclass__ isolates the registries per subclass. The full canonical-name catalog is in the Naming Conventions appendix; the extension recipe is in Adding an I/O Format.
The mutation contract¶
The core data-model API mutates in place and returns self (or the created entity) for chaining: def_atom, def_bond, get_topo, move, rotate, merge all modify the structure they are called on. .copy() is the explicit opt-in for an independent deep copy. Higher-level helpers in builder, reacter, and op follow the opposite convention: they must not mutate caller-owned structures unexpectedly — copy first, or build and return a new structure.
Performance model of the build loop¶
The chain build loop (PolymerBuilder._build_from_graph) is designed so that per-connection bookkeeping is bounded by monomer size and live port count, not by chain length:
- Reacter copy semantics —
Reacter.runcopies its two inputs once each. Withrecord_intermediates=False, the baseReacternever copies the merged assembly;BondReactReacter(which needs a pre-reaction snapshot forfix bond/reacttemplate generation) takes exactly one. - Adjacency reuse —
TopologyDetector.detect_and_update_topologybuilds an atom → neighbors adjacency map once per call (O(bonds)) and threads it through every neighbor query, so angle/dihedral/improper enumeration is O(degree) per query. - Port registry — the build loop tracks live port atoms per monomer node in a registry remapped through each connection's entity map; the growing chain is never rescanned.
- Group map — monomer-to-structure membership uses a group-id map with smaller-into-larger union instead of per-edge identity scans.
- Vectorized placement —
Placer._apply_transformapplies(coords - pivot) @ R.T + pivot + tas one (N, 3) NumPy operation.
What still scales with chain length per connection: the reacter's input copy of the accumulated structure and the merge itself — O(chain) each, giving O(N²) total copying for a DP=N chain. Eliminating that requires an in-place assembly mode and is currently out of scope. Counting-based performance tests live in tests/test_reacter/test_perf_copy_semantics.py and tests/test_builder/test_polymer_build_perf.py.
Where extension happens¶
| I want to add… | Layer | Guide |
|---|---|---|
| an analysis operation | plug-in interface | Adding a Compute Operation |
| a file format | plug-in interface | Adding an I/O Format |
| an external tool integration | plug-in interface | Adding a Wrapper or Adapter |
| an entity/link/struct type | core internals — open an issue first | Extending the Data Model |
| an interaction style / kernel | core internals — open an issue first | Extending the Force Field |