Parser¶
Grammar-based parsing for chemical string notations. Convenience functions at mp.parser.*.
Quick reference¶
| Function | Input | Output | Use when |
|---|---|---|---|
parse_molecule(s) |
SMILES | Atomistic |
One specific molecule |
parse_mixture(s) |
dot-separated SMILES | list[Atomistic] |
Multi-component ([Li+].[F-]) |
parse_monomer(s) |
BigSMILES | Atomistic (with ports) |
Repeat unit with </>/$ markers |
parse_polymer(s) |
BigSMILES | PolymerSpec |
Multi-monomer specification |
parse_smarts(s) |
SMARTS | SmartsIR |
Pattern matching / typification |
parse_smiles(s) |
SMILES | SmilesGraphIR |
IR-level inspection |
parse_bigsmiles(s) |
BigSMILES | BigSmilesMoleculeIR |
IR-level BigSMILES inspection |
parse_cgsmiles(s) |
CGSmiles | CGSmilesIR |
Topology architecture graphs |
parse_gbigsmiles(s) |
GBigSMILES | GBigSmilesSystemIR |
System specs with distributions |
Canonical example¶
import molpy as mp
mol = mp.parser.parse_molecule("CCO") # Atomistic
ions = mp.parser.parse_mixture("[Li+].[F-]") # [Atomistic, Atomistic]
monomer = mp.parser.parse_monomer("{[][<]CCO[>][]}") # Atomistic with ports
spec = mp.parser.parse_polymer("{[<]CC[>],[<]CC(C)[>]}") # PolymerSpec
Related¶
smilesir_to_atomistic— SMILES IR → Atomisticbigsmilesir_to_monomer— BigSMILES IR → Atomisticbigsmilesir_to_polymerspec— BigSMILES IR → PolymerSpec- Guide: Parsing Chemistry
Full API¶
Convenience layer¶
parser ¶
Unified parser API for SMILES, BigSMILES, GBigSMILES, CGSmiles, and SMARTS.
Convenience wrappers live here so downstream code can do::
from molpy.parser import parse_molecule, parse_polymer, parse_smarts
PolymerSegment
dataclass
¶
PolymerSegment(monomers, composition_type=None, distribution_params=None, end_groups=list(), repeat_units_ir=list(), end_groups_ir=list())
Polymer segment specification.
PolymerSpec
dataclass
¶
Complete polymer specification.
SmartsParser ¶
Bases: GrammarParserBase
Main parser for SMARTS patterns.
Usage
parser = SmartsParser() ir = parser.parse_smarts("[#6]") ir = parser.parse_smarts("c1ccccc1") ir = parser.parse_smarts("[C,N,O]")
bigsmilesir_to_monomer ¶
Convert BigSmilesMoleculeIR to Atomistic structure (topology only).
Single responsibility: IR → Atomistic conversion only. Parsing should be done separately.
Supports BigSMILES with stochastic object: {[<]CC[>]} (ONE repeat unit only)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
BigSmilesMoleculeIR
|
BigSmilesMoleculeIR from parser |
required |
Returns:
| Type | Description |
|---|---|
Atomistic
|
Atomistic structure with ports marked on atoms, NO positions |
Raises:
| Type | Description |
|---|---|
ValueError
|
If IR contains multiple repeat units (use bigsmilesir_to_polymerspec instead) |
Examples:
bigsmilesir_to_polymerspec ¶
Convert BigSmilesIR to complete polymer specification.
Single responsibility: IR -> PolymerSpec conversion only. Parsing should be done separately.
Extracts monomers and analyzes polymer topology and composition.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
BigSmilesMoleculeIR
|
BigSmilesIR from parser |
required |
Returns:
| Type | Description |
|---|---|
PolymerSpec
|
PolymerSpec with segments, topology, and all monomers |
Examples:
parse_bigsmiles ¶
Parse a BigSMILES string into BigSmilesMoleculeIR.
This parser accepts BigSMILES syntax including stochastic objects, bond descriptors, and repeat units. It does NOT accept GBigSMILES annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
BigSMILES string |
required |
Returns:
| Type | Description |
|---|---|
BigSmilesMoleculeIR
|
BigSmilesMoleculeIR containing backbone and stochastic objects |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
parse_cgsmiles ¶
Parse a CGSmiles string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
CGSmiles string (e.g., |
required |
Returns:
| Type | Description |
|---|---|
CGSmilesIR
|
CGSmilesIR with base graph and fragment definitions |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
parse_gbigsmiles ¶
Parse a GBigSMILES string into GBigSmilesSystemIR.
This parser accepts GBigSMILES syntax including all BigSMILES features plus system size specifications and other generative annotations. Always returns GBigSmilesSystemIR, wrapping single molecules in a system structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
GBigSMILES string |
required |
Returns:
| Type | Description |
|---|---|
GBigSmilesSystemIR
|
GBigSmilesSystemIR containing the parsed system |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
parse_mixture ¶
Parse a (possibly dot-separated) SMILES string into a list of molecules.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smiles
|
str
|
SMILES string, components separated by |
required |
Returns:
| Type | Description |
|---|---|
'list[Atomistic]'
|
List of :class: |
parse_molecule ¶
Parse a SMILES string and return a single :class:Atomistic structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smiles
|
str
|
SMILES string for a single molecule (no dots). |
required |
Returns:
| Type | Description |
|---|---|
'Atomistic'
|
class: |
parse_monomer ¶
Parse a BigSMILES string and return the first monomer as :class:Atomistic.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bigsmiles
|
str
|
BigSMILES string. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Monomer |
'Atomistic'
|
class: |
parse_polymer ¶
Parse a BigSMILES string and return a :class:PolymerSpec.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
bigsmiles
|
str
|
BigSMILES string. |
required |
Returns:
| Type | Description |
|---|---|
PolymerSpec
|
class: |
parse_smarts ¶
Parse a SMARTS pattern string into :class:SmartsIR.
This is a thin wrapper around SmartsParser().parse_smarts(pattern).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pattern
|
str
|
SMARTS string. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
Parsed |
'SmartsIR'
|
class: |
parse_smiles ¶
Parse a SMILES string into SmilesGraphIR or list of SmilesGraphIR.
This parser only accepts pure SMILES syntax. It will reject BigSMILES or GBigSMILES constructs.
For dot-separated SMILES (e.g., "C.C", "CC.O"), returns a list of SmilesGraphIR, one for each disconnected component.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
SMILES string (may contain dots for mixtures) |
required |
Returns:
| Type | Description |
|---|---|
SmilesGraphIR | list[SmilesGraphIR]
|
SmilesGraphIR for single molecule, or list[SmilesGraphIR] for mixtures |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected or unclosed rings |
Examples:
smilesir_to_atomistic ¶
Convert SmilesGraphIR to Atomistic structure (topology only, no 3D coordinates).
Single responsibility: IR → Atomistic conversion only. Parsing should be done separately using parse_smiles().
This is a simple conversion function for pure SMILES (no BigSMILES features like ports or descriptors). For BigSMILES with ports, use bigsmilesir_to_monomer() instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
SmilesGraphIR
|
SmilesGraphIR from parse_smiles() |
required |
Returns:
| Type | Description |
|---|---|
Atomistic
|
Atomistic structure with atoms and bonds (no 3D coordinates, no ports) |
Examples:
SMARTS¶
smarts ¶
AtomExpressionIR
dataclass
¶
Represents logical expressions combining atom primitives.
Operators
- 'and' (&): high-priority AND
- 'or' (,): OR
- 'weak_and' (;): low-priority AND
- 'not' (!): negation
Examples:
- AtomExpressionIR(op='and', children=[primitive1, primitive2])
- AtomExpressionIR(op='not', children=[primitive])
AtomPrimitiveIR
dataclass
¶
Represents a single primitive atom pattern in SMARTS.
Examples:
- symbol='C' (carbon atom)
- atomic_num=6 (atomic number 6)
- neighbor_count=3 (X3, exactly 3 neighbors)
- ring_size=6 (r6, in 6-membered ring)
- ring_count=2 (R2, in exactly 2 rings)
- has_label='%atomA' (has label %atomA)
- matches_smarts=SmartsIR(...) (recursive SMARTS)
SmartsAtomIR
dataclass
¶
Represents a complete SMARTS atom with expression and optional label.
Attributes:
| Name | Type | Description |
|---|---|---|
expression |
AtomExpressionIR | AtomPrimitiveIR
|
The atom pattern expression |
label |
int | None
|
Optional numeric label for ring closures or references |
SmartsBondIR
dataclass
¶
Represents a bond between two SMARTS atoms.
In SMARTS, bonds are implicit (single or aromatic) unless specified. Explicit bond types can be specified between atoms.
SmartsIR
dataclass
¶
Complete SMARTS pattern intermediate representation.
Attributes:
| Name | Type | Description |
|---|---|---|
atoms |
list[SmartsAtomIR]
|
List of all atoms in the pattern |
bonds |
list[SmartsBondIR]
|
List of all bonds in the pattern |
SmartsParser ¶
Bases: GrammarParserBase
Main parser for SMARTS patterns.
Usage
parser = SmartsParser() ir = parser.parse_smarts("[#6]") ir = parser.parse_smarts("c1ccccc1") ir = parser.parse_smarts("[C,N,O]")
SmartsTransformer ¶
Bases: Transformer
Transforms Lark parse tree into SmartsIR.
Handles
- Atom primitives (symbols, atomic numbers, properties)
- Logical expressions (AND, OR, NOT, weak AND)
- Branches
- Ring closures
- Recursive SMARTS patterns
atom ¶
Process complete atom: [expression] or bare_atom, with optional label.
Returns:
| Type | Description |
|---|---|
SmartsAtomIR
|
SmartsAtomIR |
atom_id ¶
Process atom identifier (primitive).
Can be
- atom_symbol
-
+ atomic_num (atomic number)¶
- $( + SMARTS + ) (recursive SMARTS)
- %label (has label)
- X + N? (neighbor count, optional number)
- x + N? (ring connectivity, optional number)
- r + N? (ring size, optional number)
- R + N? (ring count, optional number)
- H + N? (hydrogen count, optional number)
- h + N? (implicit hydrogen count, optional number)
- D + N? (degree, optional number)
- v + N? (valence, optional number)
- +/- + N? (charge)
- a (aromatic)
- A (aliphatic)
- @ / @@ (chirality)
- NUM + atom_symbol (isotope)
- atom_class (atom class reference)
branch ¶
Process branch: the content inside or after chain. This just returns the SmartsIR from _string.
implicit_and ¶
Process implicit AND: adjacent primitives without operator (e.g. #6X3r5).
start ¶
Entry point: process complete SMARTS pattern.
The grammar produces a tree like: start atom ... atom ...
We need to build the IR from this flat or nested structure.
SMILES / BigSMILES / CGSmiles¶
smiles ¶
SMILES, BigSMILES, GBigSMILES, and CGSmiles parsers.
This module provides four explicit parser APIs: - parse_smiles: Parse pure SMILES strings - parse_bigsmiles: Parse BigSMILES strings - parse_gbigsmiles: Parse GBigSMILES strings - parse_cgsmiles: Parse CGSmiles strings
Each parser uses its own dedicated grammar and transformer.
BigSmilesMoleculeIR
dataclass
¶
Top-level structural IR for BigSMILES strings.
BigSmilesSubgraphIR
dataclass
¶
Structural fragment that carries atoms, bonds, and descriptors.
BondingDescriptorIR
dataclass
¶
BondingDescriptorIR(id=_generate_id(), symbol=None, label=None, bond_order=1, role='internal', anchor_atom=None, non_covalent_context=None, extras=dict(), position_hint=None)
Standalone descriptor node for bonding points.
Per BigSMILES v1.1: bonding descriptors attach to atoms within repeat units. The anchor_atom field tracks which atom this descriptor is attached to. If anchor_atom is None, this is a terminal bonding descriptor at the stochastic object boundary.
CGSmilesBondIR
dataclass
¶
Intermediate representation for a CGSmiles bond.
Bonds directly reference NodeIR objects, not just IDs.
CGSmilesFragmentIR
dataclass
¶
Fragment definition.
Maps a fragment name to its SMILES or CGSmiles representation.
CGSmilesGraphIR
dataclass
¶
Coarse-grained graph representation.
Represents a molecular graph with CG nodes and bonds.
CGSmilesIR
dataclass
¶
Root-level IR for CGSmiles parser.
Represents a complete CGSmiles string with base graph and fragment definitions. This is the output of the CGSmiles parser.
CGSmilesNodeIR
dataclass
¶
Intermediate representation for a CGSmiles node.
A coarse-grained node with a label (e.g., "PEO", "PMA") and optional annotations.
DistributionIR
dataclass
¶
Generative distribution applied to stochastic objects.
EndGroupIR
dataclass
¶
Optional end-group fragments that terminate stochastic objects.
GBBondingDescriptorIR
dataclass
¶
Weights associated with a bonding descriptor.
GBStochasticObjectIR
dataclass
¶
Wraps a structural stochastic object plus optional distribution.
GBigSmilesComponentIR
dataclass
¶
Single component entry in a gBigSMILES system.
GBigSmilesMoleculeIR
dataclass
¶
GBigSmilesMoleculeIR(structure, descriptor_weights=list(), stochastic_metadata=list(), extras=dict())
gBigSMILES molecule = structure + generative metadata.
GBigSmilesSystemIR
dataclass
¶
gBigSMILES system describing an ensemble of molecules.
PolymerSegment
dataclass
¶
PolymerSegment(monomers, composition_type=None, distribution_params=None, end_groups=list(), repeat_units_ir=list(), end_groups_ir=list())
Polymer segment specification.
PolymerSpec
dataclass
¶
Complete polymer specification.
RepeatUnitIR
dataclass
¶
Repeat unit captured inside a stochastic object.
SmilesAtomIR
dataclass
¶
SmilesAtomIR(id=_generate_id(), element=None, aromatic=False, charge=None, hydrogens=None, extras=dict())
Intermediate representation for a SMILES atom.
SmilesBondIR
dataclass
¶
Intermediate representation for a SMILES bond.
Bonds directly reference AtomIR objects, not just IDs.
SmilesGraphIR
dataclass
¶
Root-level IR for SMILES parser.
Represents a molecular graph with atoms and bonds. This is the output of the SMILES parser.
StochasticObjectIR
dataclass
¶
StochasticObjectIR(id=_generate_id(), terminals=TerminalDescriptorIR(), repeat_units=list(), end_groups=list(), extras=dict())
Container for repeat units, terminals, and end groups.
TerminalDescriptorIR
dataclass
¶
Terminal brackets that hold descriptors for stochastic objects.
bigsmilesir_to_monomer ¶
Convert BigSmilesMoleculeIR to Atomistic structure (topology only).
Single responsibility: IR → Atomistic conversion only. Parsing should be done separately.
Supports BigSMILES with stochastic object: {[<]CC[>]} (ONE repeat unit only)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
BigSmilesMoleculeIR
|
BigSmilesMoleculeIR from parser |
required |
Returns:
| Type | Description |
|---|---|
Atomistic
|
Atomistic structure with ports marked on atoms, NO positions |
Raises:
| Type | Description |
|---|---|
ValueError
|
If IR contains multiple repeat units (use bigsmilesir_to_polymerspec instead) |
Examples:
bigsmilesir_to_polymerspec ¶
Convert BigSmilesIR to complete polymer specification.
Single responsibility: IR -> PolymerSpec conversion only. Parsing should be done separately.
Extracts monomers and analyzes polymer topology and composition.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
BigSmilesMoleculeIR
|
BigSmilesIR from parser |
required |
Returns:
| Type | Description |
|---|---|
PolymerSpec
|
PolymerSpec with segments, topology, and all monomers |
Examples:
parse_bigsmiles ¶
Parse a BigSMILES string into BigSmilesMoleculeIR.
This parser accepts BigSMILES syntax including stochastic objects, bond descriptors, and repeat units. It does NOT accept GBigSMILES annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
BigSMILES string |
required |
Returns:
| Type | Description |
|---|---|
BigSmilesMoleculeIR
|
BigSmilesMoleculeIR containing backbone and stochastic objects |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
parse_cgsmiles ¶
Parse a CGSmiles string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
CGSmiles string (e.g., |
required |
Returns:
| Type | Description |
|---|---|
CGSmilesIR
|
CGSmilesIR with base graph and fragment definitions |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
parse_gbigsmiles ¶
Parse a GBigSMILES string into GBigSmilesSystemIR.
This parser accepts GBigSMILES syntax including all BigSMILES features plus system size specifications and other generative annotations. Always returns GBigSmilesSystemIR, wrapping single molecules in a system structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
GBigSMILES string |
required |
Returns:
| Type | Description |
|---|---|
GBigSmilesSystemIR
|
GBigSmilesSystemIR containing the parsed system |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
parse_smiles ¶
Parse a SMILES string into SmilesGraphIR or list of SmilesGraphIR.
This parser only accepts pure SMILES syntax. It will reject BigSMILES or GBigSMILES constructs.
For dot-separated SMILES (e.g., "C.C", "CC.O"), returns a list of SmilesGraphIR, one for each disconnected component.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
SMILES string (may contain dots for mixtures) |
required |
Returns:
| Type | Description |
|---|---|
SmilesGraphIR | list[SmilesGraphIR]
|
SmilesGraphIR for single molecule, or list[SmilesGraphIR] for mixtures |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected or unclosed rings |
Examples:
smilesir_to_atomistic ¶
Convert SmilesGraphIR to Atomistic structure (topology only, no 3D coordinates).
Single responsibility: IR → Atomistic conversion only. Parsing should be done separately using parse_smiles().
This is a simple conversion function for pure SMILES (no BigSMILES features like ports or descriptors). For BigSMILES with ports, use bigsmilesir_to_monomer() instead.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
SmilesGraphIR
|
SmilesGraphIR from parse_smiles() |
required |
Returns:
| Type | Description |
|---|---|
Atomistic
|
Atomistic structure with atoms and bonds (no 3D coordinates, no ports) |
Examples: