Parser¶
The parser module handles parsing of chemical strings and formats (e.g. SMILES, SMARTS).
Base¶
GrammarConfig
dataclass
¶
GrammarConfig(grammar_path, start, parser='lalr', propagate_positions=False, maybe_placeholders=False, auto_reload=True)
Configuration for the grammar-backed parser.
GrammarParserBase ¶
GrammarParserBase(config)
Bases: ABC
Base class for parsers backed by an external Lark grammar file.
Lifecycle
- Construct with a GrammarConfig
- Call parse_tree(text) to get a Lark Tree
- Implement build(tree) in subclasses to map Tree -> IR
- Call parse(text) to get your IR
Features
- Grammar is compiled once and cached
- If auto_reload=True, grammar file mtime is checked before each parse
Source code in src/molpy/parser/base.py
40 41 42 43 44 | |
parse_tree ¶
parse_tree(text)
Parse input string into a Lark parse tree.
Source code in src/molpy/parser/base.py
48 49 50 51 52 53 54 | |
SMARTS¶
AtomExpressionIR
dataclass
¶
AtomExpressionIR(op, children=list(), id=(lambda: id(AtomExpressionIR))())
Represents logical expressions combining atom primitives.
Operators
- 'and' (&): high-priority AND
- 'or' (,): OR
- 'weak_and' (;): low-priority AND
- 'not' (!): negation
Examples:
- AtomExpressionIR(op='and', children=[primitive1, primitive2])
- AtomExpressionIR(op='not', children=[primitive])
AtomPrimitiveIR
dataclass
¶
AtomPrimitiveIR(type, value=None, id=(lambda: id(AtomPrimitiveIR))())
Represents a single primitive atom pattern in SMARTS.
Examples:
- symbol='C' (carbon atom)
- atomic_num=6 (atomic number 6)
- neighbor_count=3 (X3, exactly 3 neighbors)
- ring_size=6 (r6, in 6-membered ring)
- ring_count=2 (R2, in exactly 2 rings)
- has_label='%atomA' (has label %atomA)
- matches_smarts=SmartsIR(...) (recursive SMARTS)
SmartsAtomIR
dataclass
¶
SmartsAtomIR(expression, label=None, id=(lambda: id(SmartsAtomIR))())
Represents a complete SMARTS atom with expression and optional label.
Attributes:
| Name | Type | Description |
|---|---|---|
expression |
AtomExpressionIR | AtomPrimitiveIR
|
The atom pattern expression |
label |
int | None
|
Optional numeric label for ring closures or references |
SmartsBondIR
dataclass
¶
SmartsBondIR(itom, jtom, bond_type='-')
Represents a bond between two SMARTS atoms.
In SMARTS, bonds are implicit (single or aromatic) unless specified. Explicit bond types can be specified between atoms.
SmartsIR
dataclass
¶
SmartsIR(atoms=list(), bonds=list())
Complete SMARTS pattern intermediate representation.
Attributes:
| Name | Type | Description |
|---|---|---|
atoms |
list[SmartsAtomIR]
|
List of all atoms in the pattern |
bonds |
list[SmartsBondIR]
|
List of all bonds in the pattern |
SmartsParser ¶
SmartsParser()
Bases: GrammarParserBase
Main parser for SMARTS patterns.
Usage
parser = SmartsParser() ir = parser.parse_smarts("[#6]") ir = parser.parse_smarts("c1ccccc1") ir = parser.parse_smarts("[C,N,O]")
Source code in src/molpy/parser/smarts.py
535 536 537 538 539 540 541 542 543 544 | |
parse_smarts ¶
parse_smarts(smarts)
Parse SMARTS string into SmartsIR.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smarts
|
str
|
SMARTS pattern string |
required |
Returns:
| Type | Description |
|---|---|
SmartsIR
|
SmartsIR representing the pattern |
Raises:
| Type | Description |
|---|---|
ValueError
|
if parsing fails or rings are unclosed |
Examples:
>>> parser = SmartsParser()
>>> ir = parser.parse_smarts("C")
>>> len(ir.atoms)
1
>>> ir = parser.parse_smarts("[#6]")
>>> ir.atoms[0].expression.children[0].type
'atomic_num'
Source code in src/molpy/parser/smarts.py
546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 | |
SmartsTransformer ¶
SmartsTransformer()
Bases: Transformer
Transforms Lark parse tree into SmartsIR.
Handles
- Atom primitives (symbols, atomic numbers, properties)
- Logical expressions (AND, OR, NOT, weak AND)
- Branches
- Ring closures
- Recursive SMARTS patterns
Source code in src/molpy/parser/smarts.py
164 165 166 167 | |
and_expression ¶
and_expression(children)
Process high-priority AND expression (&).
Source code in src/molpy/parser/smarts.py
308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 | |
atom ¶
atom(children)
Process complete atom: [expression] or symbol, with optional label.
Returns:
| Type | Description |
|---|---|
SmartsAtomIR
|
SmartsAtomIR |
Source code in src/molpy/parser/smarts.py
375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 | |
atom_id ¶
atom_id(children)
Process atom identifier (primitive).
Can be
- atom_symbol
-
+ atomic_num (atomic number)¶
- $( + SMARTS + ) (recursive SMARTS)
- %label (has label)
- X + N (neighbor count)
- r + N (ring size)
- R + N (ring count)
Source code in src/molpy/parser/smarts.py
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 | |
atom_label ¶
atom_label(children)
Extract atom label (numeric).
Source code in src/molpy/parser/smarts.py
371 372 373 | |
atom_symbol ¶
atom_symbol(children)
Process atom symbol (element or wildcard).
Source code in src/molpy/parser/smarts.py
183 184 185 186 187 188 | |
atomic_num ¶
atomic_num(children)
Extract atomic number.
Source code in src/molpy/parser/smarts.py
190 191 192 | |
branch ¶
branch(children)
Process branch: the content inside or after chain. This just returns the SmartsIR from _string.
Source code in src/molpy/parser/smarts.py
405 406 407 408 409 410 411 | |
has_label ¶
has_label(children)
Extract label.
Source code in src/molpy/parser/smarts.py
214 215 216 | |
hydrogen_count ¶
hydrogen_count(children)
Extract explicit hydrogen count.
Source code in src/molpy/parser/smarts.py
206 207 208 | |
implicit_hydrogen_count ¶
implicit_hydrogen_count(children)
Extract implicit hydrogen count.
Source code in src/molpy/parser/smarts.py
210 211 212 | |
matches_string ¶
matches_string(children)
Extract recursive SMARTS pattern.
Source code in src/molpy/parser/smarts.py
218 219 220 | |
neighbor_count ¶
neighbor_count(children)
Extract neighbor count.
Source code in src/molpy/parser/smarts.py
194 195 196 | |
nonlastbranch ¶
nonlastbranch(children)
Process non-last branch: (branch_content).
Source code in src/molpy/parser/smarts.py
419 420 421 422 423 424 425 426 427 428 429 | |
not_expression ¶
not_expression(children)
Process NOT expression (!).
Source code in src/molpy/parser/smarts.py
300 301 302 303 304 305 306 | |
or_expression ¶
or_expression(children)
Process OR expression (,).
Source code in src/molpy/parser/smarts.py
332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 | |
ring_count ¶
ring_count(children)
Extract ring count.
Source code in src/molpy/parser/smarts.py
202 203 204 | |
ring_size ¶
ring_size(children)
Extract ring size.
Source code in src/molpy/parser/smarts.py
198 199 200 | |
start ¶
start(children)
Entry point: process complete SMARTS pattern.
The grammar produces a tree like: start atom ... atom ...
We need to build the IR from this flat or nested structure.
Source code in src/molpy/parser/smarts.py
440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 | |
weak_and_expression ¶
weak_and_expression(children)
Process low-priority AND expression (;).
Source code in src/molpy/parser/smarts.py
351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 | |
SMILES¶
SMILES, BigSMILES, GBigSMILES, and CGSmiles parsers.
This module provides four explicit parser APIs: - parse_smiles: Parse pure SMILES strings - parse_bigsmiles: Parse BigSMILES strings - parse_gbigsmiles: Parse GBigSMILES strings - parse_cgsmiles: Parse CGSmiles strings
Each parser uses its own dedicated grammar and transformer.
BigSmilesMoleculeIR
dataclass
¶
BigSmilesMoleculeIR(backbone=BigSmilesSubgraphIR(), stochastic_objects=list())
Top-level structural IR for BigSMILES strings.
BigSmilesSubgraphIR
dataclass
¶
BigSmilesSubgraphIR(atoms=list(), bonds=list(), descriptors=list())
Structural fragment that carries atoms, bonds, and descriptors.
BondingDescriptorIR
dataclass
¶
BondingDescriptorIR(id=_generate_id(), symbol=None, label=None, bond_order=1, role='internal', anchor_atom=None, non_covalent_context=None, extras=dict(), position_hint=None)
Standalone descriptor node for bonding points.
Per BigSMILES v1.1: bonding descriptors attach to atoms within repeat units. The anchor_atom field tracks which atom this descriptor is attached to. If anchor_atom is None, this is a terminal bonding descriptor at the stochastic object boundary.
CGSmilesBondIR
dataclass
¶
CGSmilesBondIR(node_i, node_j, order=1, id=_generate_id())
Intermediate representation for a CGSmiles bond.
Bonds directly reference NodeIR objects, not just IDs.
CGSmilesFragmentIR
dataclass
¶
CGSmilesFragmentIR(name='', body='')
Fragment definition.
Maps a fragment name to its SMILES or CGSmiles representation.
CGSmilesGraphIR
dataclass
¶
CGSmilesGraphIR(nodes=list(), bonds=list())
Coarse-grained graph representation.
Represents a molecular graph with CG nodes and bonds.
CGSmilesIR
dataclass
¶
CGSmilesIR(base_graph=CGSmilesGraphIR(), fragments=list())
Root-level IR for CGSmiles parser.
Represents a complete CGSmiles string with base graph and fragment definitions. This is the output of the CGSmiles parser.
CGSmilesNodeIR
dataclass
¶
CGSmilesNodeIR(id=_generate_id(), label='', annotations=dict())
Intermediate representation for a CGSmiles node.
A coarse-grained node with a label (e.g., "PEO", "PMA") and optional annotations.
DistributionIR
dataclass
¶
DistributionIR(name, params=dict())
Generative distribution applied to stochastic objects.
EndGroupIR
dataclass
¶
EndGroupIR(id=_generate_id(), graph=BigSmilesSubgraphIR(), extras=dict())
Optional end-group fragments that terminate stochastic objects.
GBBondingDescriptorIR
dataclass
¶
GBBondingDescriptorIR(structural, global_weight=None, pair_weights=None, extras=dict())
Weights associated with a bonding descriptor.
GBStochasticObjectIR
dataclass
¶
GBStochasticObjectIR(structural, distribution=None)
Wraps a structural stochastic object plus optional distribution.
GBigSmilesComponentIR
dataclass
¶
GBigSmilesComponentIR(molecule, target_mass=None, mass_is_fraction=False, extras=dict())
Single component entry in a gBigSMILES system.
GBigSmilesMoleculeIR
dataclass
¶
GBigSmilesMoleculeIR(structure, descriptor_weights=list(), stochastic_metadata=list(), extras=dict())
gBigSMILES molecule = structure + generative metadata.
GBigSmilesSystemIR
dataclass
¶
GBigSmilesSystemIR(molecules=list(), total_mass=None)
gBigSMILES system describing an ensemble of molecules.
PolymerSegment
dataclass
¶
PolymerSegment(monomers, composition_type=None, distribution_params=None, end_groups=list(), repeat_units_ir=list(), end_groups_ir=list())
Polymer segment specification.
PolymerSpec
dataclass
¶
PolymerSpec(segments, topology, start_group_ir=None, end_group_ir=None)
Complete polymer specification.
all_monomers ¶
all_monomers()
Get all structures from all segments.
Source code in src/molpy/parser/smiles/converter.py
65 66 67 | |
RepeatUnitIR
dataclass
¶
RepeatUnitIR(id=_generate_id(), graph=BigSmilesSubgraphIR(), extras=dict())
Repeat unit captured inside a stochastic object.
SmilesAtomIR
dataclass
¶
SmilesAtomIR(id=_generate_id(), element=None, aromatic=False, charge=None, hydrogens=None, extras=dict())
Intermediate representation for a SMILES atom.
SmilesBondIR
dataclass
¶
SmilesBondIR(itom, jtom, order=1, stereo=None, id=_generate_id())
Intermediate representation for a SMILES bond.
Bonds directly reference AtomIR objects, not just IDs.
SmilesGraphIR
dataclass
¶
SmilesGraphIR(atoms=list(), bonds=list())
Root-level IR for SMILES parser.
Represents a molecular graph with atoms and bonds. This is the output of the SMILES parser.
StochasticObjectIR
dataclass
¶
StochasticObjectIR(id=_generate_id(), terminals=TerminalDescriptorIR(), repeat_units=list(), end_groups=list(), extras=dict())
Container for repeat units, terminals, and end groups.
TerminalDescriptorIR
dataclass
¶
TerminalDescriptorIR(descriptors=list(), extras=dict())
Terminal brackets that hold descriptors for stochastic objects.
bigsmilesir_to_monomer ¶
bigsmilesir_to_monomer(ir)
Convert BigSmilesMoleculeIR to Atomistic structure (topology only).
Single responsibility: IR → Atomistic conversion only. Parsing should be done separately.
Supports BigSMILES with stochastic object: {[<]CC[>]} (ONE repeat unit only)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
BigSmilesMoleculeIR
|
BigSmilesMoleculeIR from parser |
required |
Returns:
| Type | Description |
|---|---|
Atomistic
|
Atomistic structure with ports marked on atoms, NO positions |
Raises:
| Type | Description |
|---|---|
ValueError
|
If IR contains multiple repeat units (use bigsmilesir_to_polymerspec instead) |
Examples:
>>> from molpy.parser.smiles import parse_bigsmiles
>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> struct = bigsmilesir_to_monomer(ir)
>>> # Ports are marked on atoms: atom["port"] = "<" or ">"
Source code in src/molpy/parser/smiles/converter.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
bigsmilesir_to_polymerspec ¶
bigsmilesir_to_polymerspec(ir)
Convert BigSmilesIR to complete polymer specification.
Single responsibility: IR -> PolymerSpec conversion only. Parsing should be done separately.
Extracts monomers and analyzes polymer topology and composition.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ir
|
BigSmilesMoleculeIR
|
BigSmilesIR from parser |
required |
Returns:
| Type | Description |
|---|---|
PolymerSpec
|
PolymerSpec with segments, topology, and all monomers |
Examples:
>>> from molpy.parser.smiles import parse_bigsmiles
>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> spec = bigsmilesir_to_polymerspec(ir)
>>> spec.topology
'homopolymer'
Source code in src/molpy/parser/smiles/converter.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | |
parse_bigsmiles ¶
parse_bigsmiles(src)
Parse a BigSMILES string into BigSmilesMoleculeIR.
This parser accepts BigSMILES syntax including stochastic objects, bond descriptors, and repeat units. It does NOT accept GBigSMILES annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
BigSMILES string |
required |
Returns:
| Type | Description |
|---|---|
BigSmilesMoleculeIR
|
BigSmilesMoleculeIR containing backbone and stochastic objects |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> len(ir.stochastic_objects)
1
Source code in src/molpy/parser/smiles/__init__.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |
parse_cgsmiles ¶
parse_cgsmiles(src)
Parse a CGSmiles string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
CGSmiles string (e.g., "{[#PEO][#PMA]}.{#PEO=[\(]COC[\)]}") |
required |
Returns:
| Type | Description |
|---|---|
CGSmilesIR
|
CGSmilesIR with base graph and fragment definitions |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
>>> result = parse_cgsmiles("{[#PEO][#PMA][#PEO]}")
>>> len(result.base_graph.nodes)
3
>>> result = parse_cgsmiles("{[#PEO]|5}")
>>> len(result.base_graph.nodes)
5
Source code in src/molpy/parser/smiles/cgsmiles_parser.py
421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 | |
parse_gbigsmiles ¶
parse_gbigsmiles(src)
Parse a GBigSMILES string into GBigSmilesSystemIR.
This parser accepts GBigSMILES syntax including all BigSMILES features plus system size specifications and other generative annotations. Always returns GBigSmilesSystemIR, wrapping single molecules in a system structure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
GBigSMILES string |
required |
Returns:
| Type | Description |
|---|---|
GBigSmilesSystemIR
|
GBigSmilesSystemIR containing the parsed system |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected |
Examples:
>>> ir = parse_gbigsmiles("{[<]CC[>]}|5e5|")
>>> isinstance(ir, GBigSmilesSystemIR)
True
Source code in src/molpy/parser/smiles/__init__.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | |
parse_smiles ¶
parse_smiles(src)
Parse a SMILES string into SmilesGraphIR or list of SmilesGraphIR.
This parser only accepts pure SMILES syntax. It will reject BigSMILES or GBigSMILES constructs.
For dot-separated SMILES (e.g., "C.C", "CC.O"), returns a list of SmilesGraphIR, one for each disconnected component.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
src
|
str
|
SMILES string (may contain dots for mixtures) |
required |
Returns:
| Type | Description |
|---|---|
SmilesGraphIR | list[SmilesGraphIR]
|
SmilesGraphIR for single molecule, or list[SmilesGraphIR] for mixtures |
Raises:
| Type | Description |
|---|---|
ValueError
|
if syntax errors detected or unclosed rings |
Examples:
>>> ir = parse_smiles("CCO")
>>> len(ir.atoms)
3
>>> irs = parse_smiles("C.C")
>>> len(irs)
2
Source code in src/molpy/parser/smiles/__init__.py
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |