Skip to content

Parser

The parser module handles parsing of chemical strings and formats (e.g. SMILES, SMARTS).

Base

GrammarConfig dataclass

GrammarConfig(grammar_path, start, parser='lalr', propagate_positions=False, maybe_placeholders=False, auto_reload=True)

Configuration for the grammar-backed parser.

GrammarParserBase

GrammarParserBase(config)

Bases: ABC

Base class for parsers backed by an external Lark grammar file.

Lifecycle
  • Construct with a GrammarConfig
  • Call parse_tree(text) to get a Lark Tree
  • Implement build(tree) in subclasses to map Tree -> IR
  • Call parse(text) to get your IR
Features
  • Grammar is compiled once and cached
  • If auto_reload=True, grammar file mtime is checked before each parse
Source code in src/molpy/parser/base.py
40
41
42
43
44
def __init__(self, config: GrammarConfig):
    self.config = config
    self._lark: Lark | None = None
    self._mtime: float | None = None
    self._compile_grammar(force=True)

parse_tree

parse_tree(text)

Parse input string into a Lark parse tree.

Source code in src/molpy/parser/base.py
48
49
50
51
52
53
54
def parse_tree(self, text: str) -> Tree:
    """
    Parse input string into a Lark parse tree.
    """
    self._maybe_reload()
    assert self._lark is not None
    return self._lark.parse(text)

SMARTS

AtomExpressionIR dataclass

AtomExpressionIR(op, children=list(), id=(lambda: id(AtomExpressionIR))())

Represents logical expressions combining atom primitives.

Operators
  • 'and' (&): high-priority AND
  • 'or' (,): OR
  • 'weak_and' (;): low-priority AND
  • 'not' (!): negation

Examples:

  • AtomExpressionIR(op='and', children=[primitive1, primitive2])
  • AtomExpressionIR(op='not', children=[primitive])

AtomPrimitiveIR dataclass

AtomPrimitiveIR(type, value=None, id=(lambda: id(AtomPrimitiveIR))())

Represents a single primitive atom pattern in SMARTS.

Examples:

  • symbol='C' (carbon atom)
  • atomic_num=6 (atomic number 6)
  • neighbor_count=3 (X3, exactly 3 neighbors)
  • ring_size=6 (r6, in 6-membered ring)
  • ring_count=2 (R2, in exactly 2 rings)
  • has_label='%atomA' (has label %atomA)
  • matches_smarts=SmartsIR(...) (recursive SMARTS)

SmartsAtomIR dataclass

SmartsAtomIR(expression, label=None, id=(lambda: id(SmartsAtomIR))())

Represents a complete SMARTS atom with expression and optional label.

Attributes:

Name Type Description
expression AtomExpressionIR | AtomPrimitiveIR

The atom pattern expression

label int | None

Optional numeric label for ring closures or references

SmartsBondIR dataclass

SmartsBondIR(itom, jtom, bond_type='-')

Represents a bond between two SMARTS atoms.

In SMARTS, bonds are implicit (single or aromatic) unless specified. Explicit bond types can be specified between atoms.

SmartsIR dataclass

SmartsIR(atoms=list(), bonds=list())

Complete SMARTS pattern intermediate representation.

Attributes:

Name Type Description
atoms list[SmartsAtomIR]

List of all atoms in the pattern

bonds list[SmartsBondIR]

List of all bonds in the pattern

SmartsParser

SmartsParser()

Bases: GrammarParserBase

Main parser for SMARTS patterns.

Usage

parser = SmartsParser() ir = parser.parse_smarts("[#6]") ir = parser.parse_smarts("c1ccccc1") ir = parser.parse_smarts("[C,N,O]")

Source code in src/molpy/parser/smarts.py
535
536
537
538
539
540
541
542
543
544
def __init__(self):
    config = GrammarConfig(
        grammar_path=Path(__file__).parent / "grammar" / "smarts.lark",
        start="start",
        parser="earley",
        propagate_positions=True,
        maybe_placeholders=False,
        auto_reload=True,
    )
    super().__init__(config)

parse_smarts

parse_smarts(smarts)

Parse SMARTS string into SmartsIR.

Parameters:

Name Type Description Default
smarts str

SMARTS pattern string

required

Returns:

Type Description
SmartsIR

SmartsIR representing the pattern

Raises:

Type Description
ValueError

if parsing fails or rings are unclosed

Examples:

>>> parser = SmartsParser()
>>> ir = parser.parse_smarts("C")
>>> len(ir.atoms)
1
>>> ir = parser.parse_smarts("[#6]")
>>> ir.atoms[0].expression.children[0].type
'atomic_num'
Source code in src/molpy/parser/smarts.py
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
def parse_smarts(self, smarts: str) -> SmartsIR:
    """
    Parse SMARTS string into SmartsIR.

    Args:
        smarts: SMARTS pattern string

    Returns:
        SmartsIR representing the pattern

    Raises:
        ValueError: if parsing fails or rings are unclosed

    Examples:
        >>> parser = SmartsParser()
        >>> ir = parser.parse_smarts("C")
        >>> len(ir.atoms)
        1
        >>> ir = parser.parse_smarts("[#6]")
        >>> ir.atoms[0].expression.children[0].type
        'atomic_num'
    """
    tree = self.parse_tree(smarts)
    transformer = SmartsTransformer()
    ir: SmartsIR = transformer.transform(tree)

    # Check for unclosed rings
    if transformer.ring_openings:
        unclosed = list(transformer.ring_openings.keys())
        raise ValueError(f"Unclosed rings in SMARTS: {unclosed}")

    return ir

SmartsTransformer

SmartsTransformer()

Bases: Transformer

Transforms Lark parse tree into SmartsIR.

Handles
  • Atom primitives (symbols, atomic numbers, properties)
  • Logical expressions (AND, OR, NOT, weak AND)
  • Branches
  • Ring closures
  • Recursive SMARTS patterns
Source code in src/molpy/parser/smarts.py
164
165
166
167
def __init__(self):
    super().__init__()
    # Track ring openings: {ring_id: (atom, bond_type)}
    self.ring_openings: dict[str, tuple[SmartsAtomIR, str | None]] = {}

and_expression

and_expression(children)

Process high-priority AND expression (&).

Source code in src/molpy/parser/smarts.py
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
def and_expression(self, children: list) -> AtomExpressionIR:
    """Process high-priority AND expression (&)."""
    # Filter out operator tokens like &
    filtered = [c for c in children if not isinstance(c, Token)]

    if len(filtered) == 1:
        # Single primitive, wrap it
        child = filtered[0]
        if isinstance(child, AtomPrimitiveIR):
            return AtomExpressionIR(op="primitive", children=[child])
        return child

    # Flatten nested AND expressions
    flat_children = []
    for child in filtered:
        if isinstance(child, AtomExpressionIR) and child.op == "and":
            # Flatten nested AND
            flat_children.extend(child.children)
        else:
            flat_children.append(child)

    # Multiple children connected by &
    return AtomExpressionIR(op="and", children=flat_children)

atom

atom(children)

Process complete atom: [expression] or symbol, with optional label.

Returns:

Type Description
SmartsAtomIR

SmartsAtomIR

Source code in src/molpy/parser/smarts.py
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
def atom(self, children: list) -> SmartsAtomIR:
    """
    Process complete atom: [expression] or symbol, with optional label.

    Returns:
        SmartsAtomIR
    """
    # Filter out bracket tokens
    filtered = [
        c
        for c in children
        if not (isinstance(c, Token) and c.type in {"LSQB", "RSQB"})
    ]

    if not filtered:
        # Empty bracketed atom, use wildcard
        prim = AtomPrimitiveIR(type="wildcard")
        expr = AtomExpressionIR(op="primitive", children=[prim])
        return SmartsAtomIR(expression=expr)

    expression = filtered[0]
    label = filtered[1] if len(filtered) > 1 else None

    # Wrap primitive in expression if needed
    if isinstance(expression, AtomPrimitiveIR):
        expression = AtomExpressionIR(op="primitive", children=[expression])

    return SmartsAtomIR(expression=expression, label=label)

atom_id

atom_id(children)

Process atom identifier (primitive).

Can be
  • atom_symbol
  • + atomic_num (atomic number)

  • $( + SMARTS + ) (recursive SMARTS)
  • %label (has label)
  • X + N (neighbor count)
  • r + N (ring size)
  • R + N (ring count)
Source code in src/molpy/parser/smarts.py
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
def atom_id(self, children: list) -> AtomPrimitiveIR:
    """
    Process atom identifier (primitive).

    Can be:
        - atom_symbol
        - # + atomic_num (atomic number)
        - $( + SMARTS + ) (recursive SMARTS)
        - %label (has label)
        - X + N (neighbor count)
        - r + N (ring size)
        - R + N (ring count)
    """
    # Filter out operator tokens like #, X, r, R, H, h, $(, )
    filtered = [
        c
        for c in children
        if not (
            isinstance(c, Token)
            and c.value in {"#", "X", "r", "R", "H", "h", "$(", ")", "$", "("}
        )
    ]

    if not filtered:
        return AtomPrimitiveIR(type="wildcard")

    child = filtered[0]

    # Already processed atom_symbol
    if isinstance(child, AtomPrimitiveIR):
        return child

    # Check the first token in original children to determine type
    first_token = next((c for c in children if isinstance(c, Token)), None)

    if first_token and first_token.value == "#":
        # Atomic number (#N)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="atomic_num", value=child)

    elif first_token and first_token.value in {"$", "$("}:
        # Recursive SMARTS ($(...)
        if isinstance(child, SmartsIR):
            return AtomPrimitiveIR(type="matches_smarts", value=child)

    elif first_token and first_token.value == "X":
        # Neighbor count (XN)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="neighbor_count", value=child)

    elif first_token and first_token.value == "r":
        # Ring size (rN)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="ring_size", value=child)

    elif first_token and first_token.value == "R":
        # Ring count (RN)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="ring_count", value=child)

    elif first_token and first_token.value == "H":
        # Explicit hydrogen count (H2)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="hydrogen_count", value=child)

    elif first_token and first_token.value == "h":
        # Implicit hydrogen count (h2)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="implicit_hydrogen_count", value=child)

    # Label (%label)
    if isinstance(child, str) and child.startswith("%"):
        return AtomPrimitiveIR(type="has_label", value=child)

    # Fallback to symbol
    return AtomPrimitiveIR(type="symbol", value=str(child))

atom_label

atom_label(children)

Extract atom label (numeric).

Source code in src/molpy/parser/smarts.py
371
372
373
def atom_label(self, children: list[int]) -> int:
    """Extract atom label (numeric)."""
    return children[0]

atom_symbol

atom_symbol(children)

Process atom symbol (element or wildcard).

Source code in src/molpy/parser/smarts.py
183
184
185
186
187
188
def atom_symbol(self, children: list[Token]) -> AtomPrimitiveIR:
    """Process atom symbol (element or wildcard)."""
    symbol = children[0] if isinstance(children[0], str) else children[0].value
    if symbol == "*":
        return AtomPrimitiveIR(type="wildcard")
    return AtomPrimitiveIR(type="symbol", value=symbol)

atomic_num

atomic_num(children)

Extract atomic number.

Source code in src/molpy/parser/smarts.py
190
191
192
def atomic_num(self, children: list[int]) -> int:
    """Extract atomic number."""
    return children[0]

branch

branch(children)

Process branch: the content inside or after chain. This just returns the SmartsIR from _string.

Source code in src/molpy/parser/smarts.py
405
406
407
408
409
410
411
def branch(self, children: list) -> SmartsIR:
    """
    Process branch: the content inside or after chain.
    This just returns the SmartsIR from _string.
    """
    # Branch contains the result of _string
    return self._build_ir_from_children(children)

has_label

has_label(children)

Extract label.

Source code in src/molpy/parser/smarts.py
214
215
216
def has_label(self, children: list[str]) -> str:
    """Extract label."""
    return children[0]

hydrogen_count

hydrogen_count(children)

Extract explicit hydrogen count.

Source code in src/molpy/parser/smarts.py
206
207
208
def hydrogen_count(self, children: list[int]) -> int:
    """Extract explicit hydrogen count."""
    return children[0]

implicit_hydrogen_count

implicit_hydrogen_count(children)

Extract implicit hydrogen count.

Source code in src/molpy/parser/smarts.py
210
211
212
def implicit_hydrogen_count(self, children: list[int]) -> int:
    """Extract implicit hydrogen count."""
    return children[0]

matches_string

matches_string(children)

Extract recursive SMARTS pattern.

Source code in src/molpy/parser/smarts.py
218
219
220
def matches_string(self, children: list) -> SmartsIR:
    """Extract recursive SMARTS pattern."""
    return children[0]

neighbor_count

neighbor_count(children)

Extract neighbor count.

Source code in src/molpy/parser/smarts.py
194
195
196
def neighbor_count(self, children: list[int]) -> int:
    """Extract neighbor count."""
    return children[0]

nonlastbranch

nonlastbranch(children)

Process non-last branch: (branch_content).

Source code in src/molpy/parser/smarts.py
419
420
421
422
423
424
425
426
427
428
429
def nonlastbranch(self, children: list) -> tuple:
    """Process non-last branch: (branch_content)."""
    # Filter parentheses and get branch IR
    filtered = [
        c
        for c in children
        if not (isinstance(c, Token) and c.type in {"LPAR", "RPAR"})
    ]
    if filtered and isinstance(filtered[0], SmartsIR):
        return ("branch", filtered[0], None)
    return ("branch", SmartsIR(), None)

not_expression

not_expression(children)

Process NOT expression (!).

Source code in src/molpy/parser/smarts.py
300
301
302
303
304
305
306
def not_expression(self, children: list) -> AtomExpressionIR:
    """Process NOT expression (!)."""
    # Filter out operator tokens like !
    filtered = [c for c in children if not isinstance(c, Token)]
    if not filtered:
        return AtomExpressionIR(op="not", children=[])
    return AtomExpressionIR(op="not", children=[filtered[0]])

or_expression

or_expression(children)

Process OR expression (,).

Source code in src/molpy/parser/smarts.py
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def or_expression(self, children: list) -> AtomExpressionIR:
    """Process OR expression (,)."""
    # Filter out operator tokens like ,
    filtered = [c for c in children if not isinstance(c, Token)]

    if len(filtered) == 1:
        return filtered[0]

    # Flatten nested OR expressions
    flat_children = []
    for child in filtered:
        if isinstance(child, AtomExpressionIR) and child.op == "or":
            # Flatten nested OR
            flat_children.extend(child.children)
        else:
            flat_children.append(child)

    return AtomExpressionIR(op="or", children=flat_children)

ring_count

ring_count(children)

Extract ring count.

Source code in src/molpy/parser/smarts.py
202
203
204
def ring_count(self, children: list[int]) -> int:
    """Extract ring count."""
    return children[0]

ring_size

ring_size(children)

Extract ring size.

Source code in src/molpy/parser/smarts.py
198
199
200
def ring_size(self, children: list[int]) -> int:
    """Extract ring size."""
    return children[0]

start

start(children)

Entry point: process complete SMARTS pattern.

The grammar produces a tree like: start atom ... atom ...

We need to build the IR from this flat or nested structure.

Source code in src/molpy/parser/smarts.py
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
def start(self, children: list) -> SmartsIR:
    """Entry point: process complete SMARTS pattern.

    The grammar produces a tree like:
    start
      atom ...
      atom ...

    We need to build the IR from this flat or nested structure.
    """
    if not children:
        return SmartsIR()

    # children from start contains result from _string
    # which could be atoms or nested structures
    return self._build_ir_from_children(children)

weak_and_expression

weak_and_expression(children)

Process low-priority AND expression (;).

Source code in src/molpy/parser/smarts.py
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def weak_and_expression(self, children: list) -> AtomExpressionIR:
    """Process low-priority AND expression (;)."""
    # Filter out operator tokens like ;
    filtered = [c for c in children if not isinstance(c, Token)]

    if len(filtered) == 1:
        return filtered[0]

    # Flatten nested weak AND expressions
    flat_children = []
    for child in filtered:
        if isinstance(child, AtomExpressionIR) and child.op == "weak_and":
            # Flatten nested weak AND
            flat_children.extend(child.children)
        else:
            flat_children.append(child)

    return AtomExpressionIR(op="weak_and", children=flat_children)

SMILES

SMILES, BigSMILES, GBigSMILES, and CGSmiles parsers.

This module provides four explicit parser APIs: - parse_smiles: Parse pure SMILES strings - parse_bigsmiles: Parse BigSMILES strings - parse_gbigsmiles: Parse GBigSMILES strings - parse_cgsmiles: Parse CGSmiles strings

Each parser uses its own dedicated grammar and transformer.

BigSmilesMoleculeIR dataclass

BigSmilesMoleculeIR(backbone=BigSmilesSubgraphIR(), stochastic_objects=list())

Top-level structural IR for BigSMILES strings.

BigSmilesSubgraphIR dataclass

BigSmilesSubgraphIR(atoms=list(), bonds=list(), descriptors=list())

Structural fragment that carries atoms, bonds, and descriptors.

BondingDescriptorIR dataclass

BondingDescriptorIR(id=_generate_id(), symbol=None, label=None, bond_order=1, role='internal', anchor_atom=None, non_covalent_context=None, extras=dict(), position_hint=None)

Standalone descriptor node for bonding points.

Per BigSMILES v1.1: bonding descriptors attach to atoms within repeat units. The anchor_atom field tracks which atom this descriptor is attached to. If anchor_atom is None, this is a terminal bonding descriptor at the stochastic object boundary.

CGSmilesBondIR dataclass

CGSmilesBondIR(node_i, node_j, order=1, id=_generate_id())

Intermediate representation for a CGSmiles bond.

Bonds directly reference NodeIR objects, not just IDs.

CGSmilesFragmentIR dataclass

CGSmilesFragmentIR(name='', body='')

Fragment definition.

Maps a fragment name to its SMILES or CGSmiles representation.

CGSmilesGraphIR dataclass

CGSmilesGraphIR(nodes=list(), bonds=list())

Coarse-grained graph representation.

Represents a molecular graph with CG nodes and bonds.

CGSmilesIR dataclass

CGSmilesIR(base_graph=CGSmilesGraphIR(), fragments=list())

Root-level IR for CGSmiles parser.

Represents a complete CGSmiles string with base graph and fragment definitions. This is the output of the CGSmiles parser.

CGSmilesNodeIR dataclass

CGSmilesNodeIR(id=_generate_id(), label='', annotations=dict())

Intermediate representation for a CGSmiles node.

A coarse-grained node with a label (e.g., "PEO", "PMA") and optional annotations.

DistributionIR dataclass

DistributionIR(name, params=dict())

Generative distribution applied to stochastic objects.

EndGroupIR dataclass

EndGroupIR(id=_generate_id(), graph=BigSmilesSubgraphIR(), extras=dict())

Optional end-group fragments that terminate stochastic objects.

GBBondingDescriptorIR dataclass

GBBondingDescriptorIR(structural, global_weight=None, pair_weights=None, extras=dict())

Weights associated with a bonding descriptor.

GBStochasticObjectIR dataclass

GBStochasticObjectIR(structural, distribution=None)

Wraps a structural stochastic object plus optional distribution.

GBigSmilesComponentIR dataclass

GBigSmilesComponentIR(molecule, target_mass=None, mass_is_fraction=False, extras=dict())

Single component entry in a gBigSMILES system.

GBigSmilesMoleculeIR dataclass

GBigSmilesMoleculeIR(structure, descriptor_weights=list(), stochastic_metadata=list(), extras=dict())

gBigSMILES molecule = structure + generative metadata.

GBigSmilesSystemIR dataclass

GBigSmilesSystemIR(molecules=list(), total_mass=None)

gBigSMILES system describing an ensemble of molecules.

PolymerSegment dataclass

PolymerSegment(monomers, composition_type=None, distribution_params=None, end_groups=list(), repeat_units_ir=list(), end_groups_ir=list())

Polymer segment specification.

PolymerSpec dataclass

PolymerSpec(segments, topology, start_group_ir=None, end_group_ir=None)

Complete polymer specification.

all_monomers

all_monomers()

Get all structures from all segments.

Source code in src/molpy/parser/smiles/converter.py
65
66
67
def all_monomers(self) -> list[Atomistic]:
    """Get all structures from all segments."""
    return [struct for segment in self.segments for struct in segment.monomers]

RepeatUnitIR dataclass

RepeatUnitIR(id=_generate_id(), graph=BigSmilesSubgraphIR(), extras=dict())

Repeat unit captured inside a stochastic object.

SmilesAtomIR dataclass

SmilesAtomIR(id=_generate_id(), element=None, aromatic=False, charge=None, hydrogens=None, extras=dict())

Intermediate representation for a SMILES atom.

SmilesBondIR dataclass

SmilesBondIR(itom, jtom, order=1, stereo=None, id=_generate_id())

Intermediate representation for a SMILES bond.

Bonds directly reference AtomIR objects, not just IDs.

SmilesGraphIR dataclass

SmilesGraphIR(atoms=list(), bonds=list())

Root-level IR for SMILES parser.

Represents a molecular graph with atoms and bonds. This is the output of the SMILES parser.

StochasticObjectIR dataclass

StochasticObjectIR(id=_generate_id(), terminals=TerminalDescriptorIR(), repeat_units=list(), end_groups=list(), extras=dict())

Container for repeat units, terminals, and end groups.

TerminalDescriptorIR dataclass

TerminalDescriptorIR(descriptors=list(), extras=dict())

Terminal brackets that hold descriptors for stochastic objects.

bigsmilesir_to_monomer

bigsmilesir_to_monomer(ir)

Convert BigSmilesMoleculeIR to Atomistic structure (topology only).

Single responsibility: IR → Atomistic conversion only. Parsing should be done separately.

Supports BigSMILES with stochastic object: {[<]CC[>]} (ONE repeat unit only)

Parameters:

Name Type Description Default
ir BigSmilesMoleculeIR

BigSmilesMoleculeIR from parser

required

Returns:

Type Description
Atomistic

Atomistic structure with ports marked on atoms, NO positions

Raises:

Type Description
ValueError

If IR contains multiple repeat units (use bigsmilesir_to_polymerspec instead)

Examples:

>>> from molpy.parser.smiles import parse_bigsmiles
>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> struct = bigsmilesir_to_monomer(ir)
>>> # Ports are marked on atoms: atom["port"] = "<" or ">"
Source code in src/molpy/parser/smiles/converter.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
def bigsmilesir_to_monomer(ir: BigSmilesMoleculeIR) -> Atomistic:
    """
    Convert BigSmilesMoleculeIR to Atomistic structure (topology only).

    Single responsibility: IR → Atomistic conversion only.
    Parsing should be done separately.

    Supports BigSMILES with stochastic object: {[<]CC[>]} (ONE repeat unit only)

    Args:
        ir: BigSmilesMoleculeIR from parser

    Returns:
        Atomistic structure with ports marked on atoms, NO positions

    Raises:
        ValueError: If IR contains multiple repeat units (use bigsmilesir_to_polymerspec instead)

    Examples:
        >>> from molpy.parser.smiles import parse_bigsmiles
        >>> ir = parse_bigsmiles("{[<]CC[>]}")
        >>> struct = bigsmilesir_to_monomer(ir)
        >>> # Ports are marked on atoms: atom["port"] = "<" or ">"
    """
    monomers = []

    # Extract from stochastic objects
    for stoch_obj in ir.stochastic_objects:
        for repeat_unit in stoch_obj.repeat_units:
            monomer = create_monomer_from_repeat_unit(repeat_unit, stoch_obj)
            if monomer is not None:
                monomers.append(monomer)

    if len(monomers) == 1:
        return monomers[0]
    elif len(monomers) > 1:
        raise ValueError(
            f"BigSmilesMoleculeIR contains {len(monomers)} repeat units. "
            "Use bigsmilesir_to_polymerspec() for multiple repeat units."
        )

    raise ValueError(
        "BigSmilesMoleculeIR contains no repeat units. " "Use {[<]...[>]} format."
    )

bigsmilesir_to_polymerspec

bigsmilesir_to_polymerspec(ir)

Convert BigSmilesIR to complete polymer specification.

Single responsibility: IR -> PolymerSpec conversion only. Parsing should be done separately.

Extracts monomers and analyzes polymer topology and composition.

Parameters:

Name Type Description Default
ir BigSmilesMoleculeIR

BigSmilesIR from parser

required

Returns:

Type Description
PolymerSpec

PolymerSpec with segments, topology, and all monomers

Examples:

>>> from molpy.parser.smiles import parse_bigsmiles
>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> spec = bigsmilesir_to_polymerspec(ir)
>>> spec.topology
'homopolymer'
Source code in src/molpy/parser/smiles/converter.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def bigsmilesir_to_polymerspec(ir: BigSmilesMoleculeIR) -> PolymerSpec:
    """
    Convert BigSmilesIR to complete polymer specification.

    Single responsibility: IR -> PolymerSpec conversion only.
    Parsing should be done separately.

    Extracts monomers and analyzes polymer topology and composition.

    Args:
        ir: BigSmilesIR from parser

    Returns:
        PolymerSpec with segments, topology, and all monomers

    Examples:
        >>> from molpy.parser.smiles import parse_bigsmiles
        >>> ir = parse_bigsmiles("{[<]CC[>]}")
        >>> spec = bigsmilesir_to_polymerspec(ir)
        >>> spec.topology
        'homopolymer'
    """
    return extract_polymerspec_from_ir(ir)

parse_bigsmiles

parse_bigsmiles(src)

Parse a BigSMILES string into BigSmilesMoleculeIR.

This parser accepts BigSMILES syntax including stochastic objects, bond descriptors, and repeat units. It does NOT accept GBigSMILES annotations.

Parameters:

Name Type Description Default
src str

BigSMILES string

required

Returns:

Type Description
BigSmilesMoleculeIR

BigSmilesMoleculeIR containing backbone and stochastic objects

Raises:

Type Description
ValueError

if syntax errors detected

Examples:

>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> len(ir.stochastic_objects)
1
Source code in src/molpy/parser/smiles/__init__.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def parse_bigsmiles(src: str) -> BigSmilesMoleculeIR:
    """
    Parse a BigSMILES string into BigSmilesMoleculeIR.

    This parser accepts BigSMILES syntax including stochastic objects,
    bond descriptors, and repeat units. It does NOT accept GBigSMILES
    annotations.

    Args:
        src: BigSMILES string

    Returns:
        BigSmilesMoleculeIR containing backbone and stochastic objects

    Raises:
        ValueError: if syntax errors detected

    Examples:
        >>> ir = parse_bigsmiles("{[<]CC[>]}")
        >>> len(ir.stochastic_objects)
        1
    """
    return _bigsmiles_parser.parse(src)

parse_cgsmiles

parse_cgsmiles(src)

Parse a CGSmiles string.

Parameters:

Name Type Description Default
src str

CGSmiles string (e.g., "{[#PEO][#PMA]}.{#PEO=[\(]COC[\)]}")

required

Returns:

Type Description
CGSmilesIR

CGSmilesIR with base graph and fragment definitions

Raises:

Type Description
ValueError

if syntax errors detected

Examples:

>>> result = parse_cgsmiles("{[#PEO][#PMA][#PEO]}")
>>> len(result.base_graph.nodes)
3
>>> result = parse_cgsmiles("{[#PEO]|5}")
>>> len(result.base_graph.nodes)
5
Source code in src/molpy/parser/smiles/cgsmiles_parser.py
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
def parse_cgsmiles(src: str) -> CGSmilesIR:
    """Parse a CGSmiles string.

    Args:
        src: CGSmiles string (e.g., "{[#PEO][#PMA]}.{#PEO=[$]COC[$]}")

    Returns:
        CGSmilesIR with base graph and fragment definitions

    Raises:
        ValueError: if syntax errors detected

    Examples:
        >>> result = parse_cgsmiles("{[#PEO][#PMA][#PEO]}")
        >>> len(result.base_graph.nodes)
        3
        >>> result = parse_cgsmiles("{[#PEO]|5}")
        >>> len(result.base_graph.nodes)
        5
    """
    return _parser.parse(src)

parse_gbigsmiles

parse_gbigsmiles(src)

Parse a GBigSMILES string into GBigSmilesSystemIR.

This parser accepts GBigSMILES syntax including all BigSMILES features plus system size specifications and other generative annotations. Always returns GBigSmilesSystemIR, wrapping single molecules in a system structure.

Parameters:

Name Type Description Default
src str

GBigSMILES string

required

Returns:

Type Description
GBigSmilesSystemIR

GBigSmilesSystemIR containing the parsed system

Raises:

Type Description
ValueError

if syntax errors detected

Examples:

>>> ir = parse_gbigsmiles("{[<]CC[>]}|5e5|")
>>> isinstance(ir, GBigSmilesSystemIR)
True
Source code in src/molpy/parser/smiles/__init__.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
def parse_gbigsmiles(src: str) -> GBigSmilesSystemIR:
    """
    Parse a GBigSMILES string into GBigSmilesSystemIR.

    This parser accepts GBigSMILES syntax including all BigSMILES
    features plus system size specifications and other generative
    annotations. Always returns GBigSmilesSystemIR, wrapping single
    molecules in a system structure.

    Args:
        src: GBigSMILES string

    Returns:
        GBigSmilesSystemIR containing the parsed system

    Raises:
        ValueError: if syntax errors detected

    Examples:
        >>> ir = parse_gbigsmiles("{[<]CC[>]}|5e5|")
        >>> isinstance(ir, GBigSmilesSystemIR)
        True
    """
    return _gbigsmiles_parser.parse(src)

parse_smiles

parse_smiles(src)

Parse a SMILES string into SmilesGraphIR or list of SmilesGraphIR.

This parser only accepts pure SMILES syntax. It will reject BigSMILES or GBigSMILES constructs.

For dot-separated SMILES (e.g., "C.C", "CC.O"), returns a list of SmilesGraphIR, one for each disconnected component.

Parameters:

Name Type Description Default
src str

SMILES string (may contain dots for mixtures)

required

Returns:

Type Description
SmilesGraphIR | list[SmilesGraphIR]

SmilesGraphIR for single molecule, or list[SmilesGraphIR] for mixtures

Raises:

Type Description
ValueError

if syntax errors detected or unclosed rings

Examples:

>>> ir = parse_smiles("CCO")
>>> len(ir.atoms)
3
>>> irs = parse_smiles("C.C")
>>> len(irs)
2
Source code in src/molpy/parser/smiles/__init__.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
def parse_smiles(src: str) -> SmilesGraphIR | list[SmilesGraphIR]:
    """
    Parse a SMILES string into SmilesGraphIR or list of SmilesGraphIR.

    This parser only accepts pure SMILES syntax. It will reject
    BigSMILES or GBigSMILES constructs.

    For dot-separated SMILES (e.g., "C.C", "CC.O"), returns a list of
    SmilesGraphIR, one for each disconnected component.

    Args:
        src: SMILES string (may contain dots for mixtures)

    Returns:
        SmilesGraphIR for single molecule, or list[SmilesGraphIR] for mixtures

    Raises:
        ValueError: if syntax errors detected or unclosed rings

    Examples:
        >>> ir = parse_smiles("CCO")
        >>> len(ir.atoms)
        3
        >>> irs = parse_smiles("C.C")
        >>> len(irs)
        2
    """
    return _smiles_parser.parse(src)