Parsing Chemistry¶
"How do I tell MolPy what molecule I want?"
This is the first question every user has. In computational chemistry, we rarely want to build molecules atom-by-atom (e.g., atom.add(x=1, y=0, z=0)). Instead, we use Line Notations—compact string schemas that describe chemical structures.
MolPy is designed to work seamlessly with modern string formats that go beyond standard chemistry. We support four distinct languages, each serving a specific purpose in the workflow:
- SMILES: For defining exact, specific molecules ("I want Ethanol").
- SMARTS: For defining abstract patterns ("I want to find all alcohol groups").
- BigSMILES: For defining stochastic building blocks ("I want a Polystyrene monomer with reactive ends").
- G-BigSMILES: For defining polymer systems with distributions ("I want polystyrene systems drawn from this distribution").
- CGSmiles: For defining high-level topology ("I want a block copolymer with Graph A connected to Graph B").
This guide will take you through each parser, explaining why you use it and how it works.
Step 1: Parse SMILES for Exact Molecules¶
Use Case: You need to load a solvent, a ligand, or a specific small molecule.
SMILES (Simplified Molecular Input Line Entry System) is the industry standard. It describes the connectivity of atoms. MolPy's parse_smiles function converts a SMILES string into an Intermediate Representation (IR)—a lightweight graph that can be converted into a full Atomistic object.
import molpy as mp
from molpy.parser.smiles import parse_smiles
# Example: Ethyl Acetate
# CC(=O)OCC
smiles_str = "CC(=O)OCC"
print(f"Parsing SMILES string: '{smiles_str}'")
ir = parse_smiles(smiles_str)
# The parser returns an IR object, which is a pure data container.
print(f"Resulting Object: {type(ir).__name__}")
print(f"Number of Atoms: {len(ir.atoms)}")
print(f"Number of Bonds: {len(ir.bonds)}")
# Let's peek at the first few atoms
for i, atom in enumerate(ir.atoms[:3]):
print(f" Atom {i}: {atom.element}")
Parsing SMILES string: 'CC(=O)OCC' Resulting Object: SmilesGraphIR Number of Atoms: 6 Number of Bonds: 5 Atom 0: C Atom 1: C Atom 2: O
Why an Intermediate Representation (IR)?¶
You might wonder: why doesn't it just return an Atomistic object directly?
The IR is a lightweight, immutable blueprint. It allows MolPy to:
- Validate structure before creating heavy objects.
- Modify the definition (e.g., adding labels) without overhead.
- Convert to multiple target types (e.g.,
Atomistic,Molecule,Fragment).
In the future, MolPy may add a convenience wrapper around this step.
Step 2: Use SMARTS for Pattern Matching¶
Use Case: You want to identify specific functional groups to apply force field parameters or perform reactions.
SMARTS is a language for describing molecular patterns. It looks like SMILES but allows for logical operators (AND, OR, NOT). This is the engine that powers the Typifier, which assigns physics to your molecule.
MolPy has a dedicated SmartsParser that builds a query graph.
from molpy.parser.smarts import SmartsParser
# Initialize the parser
parser = SmartsParser()
# Let's define a pattern for a generic Alcohol group: -C-O-H
# [C;X4]: Carbon with 4 neighbors (sp3)
# [O;H1]: Oxygen with 1 Hydrogen
pattern_str = "[C;X4][O;H1]"
print(f"Parsing SMARTS pattern: '{pattern_str}'")
query = parser.parse_smarts(pattern_str)
print(f"Query Graph Size: {len(query.atoms)} atoms")
print(f"Query Constraints: {[a.expression for a in query.atoms]}")
Parsing SMARTS pattern: '[C;X4][O;H1]' Query Graph Size: 2 atoms Query Constraints: [AtomExpressionIR(op='weak_and', children=[AtomPrimitiveIR(type='symbol', value='C'), AtomPrimitiveIR(type='neighbor_count', value=4)]), AtomExpressionIR(op='weak_and', children=[AtomPrimitiveIR(type='symbol', value='O'), AtomPrimitiveIR(type='hydrogen_count', value=1)])]
Step 3: Define Monomers with BigSMILES¶
Use Case: You are building a polymer and need to define the monomers.
SMILES is great for static molecules, but polymers are stochastic—they are made of repeating units that can connect in various ways. BigSMILES extends SMILES to handle this.
In MolPy, we use BigSMILES to define Monomer Templates. The key addition is the concept of Bonding Descriptors (or Ports), denoted by [<], [>], [$], etc.
A typical monomer definition looks like this: {...} enclosing the structure, with special atoms indicating connection points.
from molpy.parser.smiles import parse_bigsmiles, bigsmilesir_to_monomer
# Let's define a Polystyrene monomer.
# Structure: -CH2-CH(Benzene)-
# We need two connection points for the backbone: Head and Tail.
# Syntax:
# { Start of stochastic object
# [] Left terminal (empty here)
# [<] Bonding Descriptor 'Left' (Head)
# CC Backbone carbons
# (c...1) Phenyl ring side group
# [>] Bonding Descriptor 'Right' (Tail)
# [] Right terminal (empty here)
# } End
styrene_str = "{[][<]CC(c1ccccc1)[>][]}"
print(f"Parsing BigSMILES: '{styrene_str}'")
big_ir = parse_bigsmiles(styrene_str)
# Convert IR to a usable Monomer object
monomer = bigsmilesir_to_monomer(big_ir)
print(f"Parsed Monomer Name: {monomer.get('name', 'Unknown')}")
print(f"Total Atoms: {len(monomer.atoms)}")
# Verify the ports
ports = [a for a in monomer.atoms if a.get('port')]
print("Detected Connection Ports:")
for p in ports:
print(f" - {p.get('name')} (Original ID: {p.get('id')})")
Parsing BigSMILES: '{[][<]CC(c1ccccc1)[>][]}'
Parsed Monomer Name: Unknown
Total Atoms: 8
Detected Connection Ports:
- None (Original ID: 12)
- None (Original ID: 13)
Step 4: Describe Topology with CGSmiles¶
Use Case: You have your monomers (from BigSMILES) and now you want to define the polymer chain architecture (e.g., block copolymer, graft, star).
CGSmiles (Coarse-Grained SMILES) is the blueprint language. It uses the monomer names (the {#Name} syntax) to describe connectivity without worrying about atoms.
Think of BigSMILES as defining the "LEGO bricks" and CGSmiles as the "Instructions" for assembling them.
from molpy.parser.smiles import parse_cgsmiles
# Example: A Graft Copolymer.
# We have a Backbone 'A' and a Graft 'B'.
# Syntax: {#A}({#B})
# This means: Take graph A, and attach graph B as a branch.
cg_str = "{[#A]([#B])}"
print(f"Parsing CGSmiles Topology: '{cg_str}'")
cg_ir = parse_cgsmiles(cg_str)
# The result is a Topological Graph
print(f"Nodes (Monomers): {len(cg_ir.base_graph.nodes)}")
print(f"Edges (Connections): {len(cg_ir.base_graph.bonds)}")
for bond in cg_ir.base_graph.bonds:
print(f" Connection: {bond.node_i.label} <--> {bond.node_j.label} (Order: {bond.order})")
Parsing CGSmiles Topology: '{[#A]([#B])}'
Nodes (Monomers): 2
Edges (Connections): 1
Connection: A <--> B (Order: 1)
Summary¶
You now understand the language stack of MolPy:
| Abstraction Level | Language | Purpose | Output |
|---|---|---|---|
| Atomic | SMILES | Exact small molecules | SmilesGraphIR |
| Pattern | SMARTS | Substructure searching | SmartsIR |
| Monomer | BigSMILES | Building blocks with ports | BigSmilesIR |
| Topology | CGSmiles | Global architecture | CGSmilesIR |
In the next guide, 02 Polymer Stepwise, we will combine BigSMILES and CGSmiles to actually build these complex systems.