Parser¶

The parser module handles parsing of chemical strings and formats (e.g. SMILES, SMARTS).

Base¶

GrammarConfig `dataclass` ¶

GrammarConfig(grammar_path, start, parser='lalr', propagate_positions=False, maybe_placeholders=False, auto_reload=True)

Configuration for the grammar-backed parser.

GrammarParserBase ¶

GrammarParserBase(config)

Bases: ABC

Base class for parsers backed by an external Lark grammar file.

Lifecycle

Construct with a GrammarConfig
Call parse_tree(text) to get a Lark Tree
Implement build(tree) in subclasses to map Tree -> IR
Call parse(text) to get your IR

Features

Grammar is compiled once and cached
If auto_reload=True, grammar file mtime is checked before each parse

Source code in src/molpy/parser/base.py

def __init__(self, config: GrammarConfig):
    self.config = config
    self._lark: Lark | None = None
    self._mtime: float | None = None
    self._compile_grammar(force=True)

parse_tree ¶

parse_tree(text)

Parse input string into a Lark parse tree.

Source code in src/molpy/parser/base.py

def parse_tree(self, text: str) -> Tree:
    """
    Parse input string into a Lark parse tree.
    """
    self._maybe_reload()
    assert self._lark is not None
    return self._lark.parse(text)

SMARTS¶

AtomExpressionIR `dataclass` ¶

AtomExpressionIR(op, children=list(), id=(lambda: id(AtomExpressionIR))())

Represents logical expressions combining atom primitives.

Operators

'and' (&): high-priority AND
'or' (,): OR
'weak_and' (;): low-priority AND
'not' (!): negation

Examples:

AtomExpressionIR(op='and', children=[primitive1, primitive2])
AtomExpressionIR(op='not', children=[primitive])

AtomPrimitiveIR `dataclass` ¶

AtomPrimitiveIR(type, value=None, id=(lambda: id(AtomPrimitiveIR))())

Represents a single primitive atom pattern in SMARTS.

Examples:

symbol='C' (carbon atom)
atomic_num=6 (atomic number 6)
neighbor_count=3 (X3, exactly 3 neighbors)
ring_size=6 (r6, in 6-membered ring)
ring_count=2 (R2, in exactly 2 rings)
has_label='%atomA' (has label %atomA)
matches_smarts=SmartsIR(...) (recursive SMARTS)

SmartsAtomIR `dataclass` ¶

SmartsAtomIR(expression, label=None, id=(lambda: id(SmartsAtomIR))())

Represents a complete SMARTS atom with expression and optional label.

Attributes:

Name	Type	Description
`expression`	`AtomExpressionIR \| AtomPrimitiveIR`	The atom pattern expression
`label`	`int \| None`	Optional numeric label for ring closures or references

SmartsBondIR `dataclass` ¶

SmartsBondIR(itom, jtom, bond_type='-')

Represents a bond between two SMARTS atoms.

In SMARTS, bonds are implicit (single or aromatic) unless specified. Explicit bond types can be specified between atoms.

SmartsIR `dataclass` ¶

SmartsIR(atoms=list(), bonds=list())

Complete SMARTS pattern intermediate representation.

Attributes:

Name	Type	Description
`atoms`	`list[SmartsAtomIR]`	List of all atoms in the pattern
`bonds`	`list[SmartsBondIR]`	List of all bonds in the pattern

SmartsParser ¶

SmartsParser()

Bases: GrammarParserBase

Main parser for SMARTS patterns.

Usage

parser = SmartsParser() ir = parser.parse_smarts("[#6]") ir = parser.parse_smarts("c1ccccc1") ir = parser.parse_smarts("[C,N,O]")

Source code in src/molpy/parser/smarts.py

def __init__(self):
    config = GrammarConfig(
        grammar_path=Path(__file__).parent / "grammar" / "smarts.lark",
        start="start",
        parser="earley",
        propagate_positions=True,
        maybe_placeholders=False,
        auto_reload=True,
    )
    super().__init__(config)

parse_smarts ¶

parse_smarts(smarts)

Parse SMARTS string into SmartsIR.

Parameters:

Name	Type	Description	Default
`smarts`	`str`	SMARTS pattern string	required

Returns:

Type	Description
`SmartsIR`	SmartsIR representing the pattern

Raises:

Type	Description
`ValueError`	if parsing fails or rings are unclosed

Examples:

>>> parser = SmartsParser()
>>> ir = parser.parse_smarts("C")
>>> len(ir.atoms)
1
>>> ir = parser.parse_smarts("[#6]")
>>> ir.atoms[0].expression.children[0].type
'atomic_num'

Source code in src/molpy/parser/smarts.py

def parse_smarts(self, smarts: str) -> SmartsIR:
    """
    Parse SMARTS string into SmartsIR.

    Args:
        smarts: SMARTS pattern string

    Returns:
        SmartsIR representing the pattern

    Raises:
        ValueError: if parsing fails or rings are unclosed

    Examples:
        >>> parser = SmartsParser()
        >>> ir = parser.parse_smarts("C")
        >>> len(ir.atoms)
        1
        >>> ir = parser.parse_smarts("[#6]")
        >>> ir.atoms[0].expression.children[0].type
        'atomic_num'
    """
    tree = self.parse_tree(smarts)
    transformer = SmartsTransformer()
    ir: SmartsIR = transformer.transform(tree)

    # Check for unclosed rings
    if transformer.ring_openings:
        unclosed = list(transformer.ring_openings.keys())
        raise ValueError(f"Unclosed rings in SMARTS: {unclosed}")

    return ir

SmartsTransformer ¶

SmartsTransformer()

Bases: Transformer

Transforms Lark parse tree into SmartsIR.

Handles

Atom primitives (symbols, atomic numbers, properties)
Logical expressions (AND, OR, NOT, weak AND)
Branches
Ring closures
Recursive SMARTS patterns

Source code in src/molpy/parser/smarts.py

def __init__(self):
    super().__init__()
    # Track ring openings: {ring_id: (atom, bond_type)}
    self.ring_openings: dict[str, tuple[SmartsAtomIR, str | None]] = {}

and_expression ¶

and_expression(children)

Process high-priority AND expression (&).

Source code in src/molpy/parser/smarts.py

def and_expression(self, children: list) -> AtomExpressionIR:
    """Process high-priority AND expression (&)."""
    # Filter out operator tokens like &
    filtered = [c for c in children if not isinstance(c, Token)]

    if len(filtered) == 1:
        # Single primitive, wrap it
        child = filtered[0]
        if isinstance(child, AtomPrimitiveIR):
            return AtomExpressionIR(op="primitive", children=[child])
        return child

    # Flatten nested AND expressions
    flat_children = []
    for child in filtered:
        if isinstance(child, AtomExpressionIR) and child.op == "and":
            # Flatten nested AND
            flat_children.extend(child.children)
        else:
            flat_children.append(child)

    # Multiple children connected by &
    return AtomExpressionIR(op="and", children=flat_children)

atom ¶

atom(children)

Process complete atom: [expression] or symbol, with optional label.

Returns:

Type	Description
`SmartsAtomIR`	SmartsAtomIR

Source code in src/molpy/parser/smarts.py

def atom(self, children: list) -> SmartsAtomIR:
    """
    Process complete atom: [expression] or symbol, with optional label.

    Returns:
        SmartsAtomIR
    """
    # Filter out bracket tokens
    filtered = [
        c
        for c in children
        if not (isinstance(c, Token) and c.type in {"LSQB", "RSQB"})
    ]

    if not filtered:
        # Empty bracketed atom, use wildcard
        prim = AtomPrimitiveIR(type="wildcard")
        expr = AtomExpressionIR(op="primitive", children=[prim])
        return SmartsAtomIR(expression=expr)

    expression = filtered[0]
    label = filtered[1] if len(filtered) > 1 else None

    # Wrap primitive in expression if needed
    if isinstance(expression, AtomPrimitiveIR):
        expression = AtomExpressionIR(op="primitive", children=[expression])

    return SmartsAtomIR(expression=expression, label=label)

atom_id ¶

atom_id(children)

Process atom identifier (primitive).

Can be

atom_symbol
+ atomic_num (atomic number)¶
$( + SMARTS + ) (recursive SMARTS)
%label (has label)
X + N (neighbor count)
r + N (ring size)
R + N (ring count)

Source code in src/molpy/parser/smarts.py

def atom_id(self, children: list) -> AtomPrimitiveIR:
    """
    Process atom identifier (primitive).

    Can be:
        - atom_symbol
        - # + atomic_num (atomic number)
        - $( + SMARTS + ) (recursive SMARTS)
        - %label (has label)
        - X + N (neighbor count)
        - r + N (ring size)
        - R + N (ring count)
    """
    # Filter out operator tokens like #, X, r, R, H, h, $(, )
    filtered = [
        c
        for c in children
        if not (
            isinstance(c, Token)
            and c.value in {"#", "X", "r", "R", "H", "h", "$(", ")", "$", "("}
        )
    ]

    if not filtered:
        return AtomPrimitiveIR(type="wildcard")

    child = filtered[0]

    # Already processed atom_symbol
    if isinstance(child, AtomPrimitiveIR):
        return child

    # Check the first token in original children to determine type
    first_token = next((c for c in children if isinstance(c, Token)), None)

    if first_token and first_token.value == "#":
        # Atomic number (#N)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="atomic_num", value=child)

    elif first_token and first_token.value in {"$", "$("}:
        # Recursive SMARTS ($(...)
        if isinstance(child, SmartsIR):
            return AtomPrimitiveIR(type="matches_smarts", value=child)

    elif first_token and first_token.value == "X":
        # Neighbor count (XN)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="neighbor_count", value=child)

    elif first_token and first_token.value == "r":
        # Ring size (rN)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="ring_size", value=child)

    elif first_token and first_token.value == "R":
        # Ring count (RN)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="ring_count", value=child)

    elif first_token and first_token.value == "H":
        # Explicit hydrogen count (H2)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="hydrogen_count", value=child)

    elif first_token and first_token.value == "h":
        # Implicit hydrogen count (h2)
        if isinstance(child, int):
            return AtomPrimitiveIR(type="implicit_hydrogen_count", value=child)

    # Label (%label)
    if isinstance(child, str) and child.startswith("%"):
        return AtomPrimitiveIR(type="has_label", value=child)

    # Fallback to symbol
    return AtomPrimitiveIR(type="symbol", value=str(child))

atom_label ¶

atom_label(children)

Extract atom label (numeric).

Source code in src/molpy/parser/smarts.py

def atom_label(self, children: list[int]) -> int:
    """Extract atom label (numeric)."""
    return children[0]

atom_symbol ¶

atom_symbol(children)

Process atom symbol (element or wildcard).

Source code in src/molpy/parser/smarts.py

def atom_symbol(self, children: list[Token]) -> AtomPrimitiveIR:
    """Process atom symbol (element or wildcard)."""
    symbol = children[0] if isinstance(children[0], str) else children[0].value
    if symbol == "*":
        return AtomPrimitiveIR(type="wildcard")
    return AtomPrimitiveIR(type="symbol", value=symbol)

atomic_num ¶

atomic_num(children)

Extract atomic number.

Source code in src/molpy/parser/smarts.py

def atomic_num(self, children: list[int]) -> int:
    """Extract atomic number."""
    return children[0]

branch ¶

branch(children)

Process branch: the content inside or after chain. This just returns the SmartsIR from _string.

Source code in src/molpy/parser/smarts.py

def branch(self, children: list) -> SmartsIR:
    """
    Process branch: the content inside or after chain.
    This just returns the SmartsIR from _string.
    """
    # Branch contains the result of _string
    return self._build_ir_from_children(children)

has_label ¶

has_label(children)

Extract label.

Source code in src/molpy/parser/smarts.py

def has_label(self, children: list[str]) -> str:
    """Extract label."""
    return children[0]

hydrogen_count ¶

hydrogen_count(children)

Extract explicit hydrogen count.

Source code in src/molpy/parser/smarts.py

def hydrogen_count(self, children: list[int]) -> int:
    """Extract explicit hydrogen count."""
    return children[0]

implicit_hydrogen_count ¶

implicit_hydrogen_count(children)

Extract implicit hydrogen count.

Source code in src/molpy/parser/smarts.py

def implicit_hydrogen_count(self, children: list[int]) -> int:
    """Extract implicit hydrogen count."""
    return children[0]

matches_string ¶

matches_string(children)

Extract recursive SMARTS pattern.

Source code in src/molpy/parser/smarts.py

def matches_string(self, children: list) -> SmartsIR:
    """Extract recursive SMARTS pattern."""
    return children[0]

neighbor_count ¶

neighbor_count(children)

Extract neighbor count.

Source code in src/molpy/parser/smarts.py

def neighbor_count(self, children: list[int]) -> int:
    """Extract neighbor count."""
    return children[0]

nonlastbranch ¶

nonlastbranch(children)

Process non-last branch: (branch_content).

Source code in src/molpy/parser/smarts.py

def nonlastbranch(self, children: list) -> tuple:
    """Process non-last branch: (branch_content)."""
    # Filter parentheses and get branch IR
    filtered = [
        c
        for c in children
        if not (isinstance(c, Token) and c.type in {"LPAR", "RPAR"})
    ]
    if filtered and isinstance(filtered[0], SmartsIR):
        return ("branch", filtered[0], None)
    return ("branch", SmartsIR(), None)

not_expression ¶

not_expression(children)

Process NOT expression (!).

Source code in src/molpy/parser/smarts.py

def not_expression(self, children: list) -> AtomExpressionIR:
    """Process NOT expression (!)."""
    # Filter out operator tokens like !
    filtered = [c for c in children if not isinstance(c, Token)]
    if not filtered:
        return AtomExpressionIR(op="not", children=[])
    return AtomExpressionIR(op="not", children=[filtered[0]])

or_expression ¶

or_expression(children)

Process OR expression (,).

Source code in src/molpy/parser/smarts.py

def or_expression(self, children: list) -> AtomExpressionIR:
    """Process OR expression (,)."""
    # Filter out operator tokens like ,
    filtered = [c for c in children if not isinstance(c, Token)]

    if len(filtered) == 1:
        return filtered[0]

    # Flatten nested OR expressions
    flat_children = []
    for child in filtered:
        if isinstance(child, AtomExpressionIR) and child.op == "or":
            # Flatten nested OR
            flat_children.extend(child.children)
        else:
            flat_children.append(child)

    return AtomExpressionIR(op="or", children=flat_children)

ring_count ¶

ring_count(children)

Extract ring count.

Source code in src/molpy/parser/smarts.py

def ring_count(self, children: list[int]) -> int:
    """Extract ring count."""
    return children[0]

ring_size ¶

ring_size(children)

Extract ring size.

Source code in src/molpy/parser/smarts.py

def ring_size(self, children: list[int]) -> int:
    """Extract ring size."""
    return children[0]

start ¶

start(children)

Entry point: process complete SMARTS pattern.

The grammar produces a tree like: start atom ... atom ...

We need to build the IR from this flat or nested structure.

Source code in src/molpy/parser/smarts.py

def start(self, children: list) -> SmartsIR:
    """Entry point: process complete SMARTS pattern.

    The grammar produces a tree like:
    start
      atom ...
      atom ...

    We need to build the IR from this flat or nested structure.
    """
    if not children:
        return SmartsIR()

    # children from start contains result from _string
    # which could be atoms or nested structures
    return self._build_ir_from_children(children)

weak_and_expression ¶

weak_and_expression(children)

Process low-priority AND expression (;).

Source code in src/molpy/parser/smarts.py

def weak_and_expression(self, children: list) -> AtomExpressionIR:
    """Process low-priority AND expression (;)."""
    # Filter out operator tokens like ;
    filtered = [c for c in children if not isinstance(c, Token)]

    if len(filtered) == 1:
        return filtered[0]

    # Flatten nested weak AND expressions
    flat_children = []
    for child in filtered:
        if isinstance(child, AtomExpressionIR) and child.op == "weak_and":
            # Flatten nested weak AND
            flat_children.extend(child.children)
        else:
            flat_children.append(child)

    return AtomExpressionIR(op="weak_and", children=flat_children)

SMILES¶

SMILES, BigSMILES, GBigSMILES, and CGSmiles parsers.

This module provides four explicit parser APIs: - parse_smiles: Parse pure SMILES strings - parse_bigsmiles: Parse BigSMILES strings - parse_gbigsmiles: Parse GBigSMILES strings - parse_cgsmiles: Parse CGSmiles strings

Each parser uses its own dedicated grammar and transformer.

BigSmilesMoleculeIR `dataclass` ¶

BigSmilesMoleculeIR(backbone=BigSmilesSubgraphIR(), stochastic_objects=list())

Top-level structural IR for BigSMILES strings.

BigSmilesSubgraphIR `dataclass` ¶

BigSmilesSubgraphIR(atoms=list(), bonds=list(), descriptors=list())

Structural fragment that carries atoms, bonds, and descriptors.

BondingDescriptorIR `dataclass` ¶

BondingDescriptorIR(id=_generate_id(), symbol=None, label=None, bond_order=1, role='internal', anchor_atom=None, non_covalent_context=None, extras=dict(), position_hint=None)

Standalone descriptor node for bonding points.

Per BigSMILES v1.1: bonding descriptors attach to atoms within repeat units. The anchor_atom field tracks which atom this descriptor is attached to. If anchor_atom is None, this is a terminal bonding descriptor at the stochastic object boundary.

CGSmilesBondIR `dataclass` ¶

CGSmilesBondIR(node_i, node_j, order=1, id=_generate_id())

Intermediate representation for a CGSmiles bond.

Bonds directly reference NodeIR objects, not just IDs.

CGSmilesFragmentIR `dataclass` ¶

CGSmilesFragmentIR(name='', body='')

Fragment definition.

Maps a fragment name to its SMILES or CGSmiles representation.

CGSmilesGraphIR `dataclass` ¶

CGSmilesGraphIR(nodes=list(), bonds=list())

Coarse-grained graph representation.

Represents a molecular graph with CG nodes and bonds.

CGSmilesIR `dataclass` ¶

CGSmilesIR(base_graph=CGSmilesGraphIR(), fragments=list())

Root-level IR for CGSmiles parser.

Represents a complete CGSmiles string with base graph and fragment definitions. This is the output of the CGSmiles parser.

CGSmilesNodeIR `dataclass` ¶

CGSmilesNodeIR(id=_generate_id(), label='', annotations=dict())

Intermediate representation for a CGSmiles node.

A coarse-grained node with a label (e.g., "PEO", "PMA") and optional annotations.

DistributionIR `dataclass` ¶

DistributionIR(name, params=dict())

Generative distribution applied to stochastic objects.

EndGroupIR `dataclass` ¶

EndGroupIR(id=_generate_id(), graph=BigSmilesSubgraphIR(), extras=dict())

Optional end-group fragments that terminate stochastic objects.

GBBondingDescriptorIR `dataclass` ¶

GBBondingDescriptorIR(structural, global_weight=None, pair_weights=None, extras=dict())

Weights associated with a bonding descriptor.

GBStochasticObjectIR `dataclass` ¶

GBStochasticObjectIR(structural, distribution=None)

Wraps a structural stochastic object plus optional distribution.

GBigSmilesComponentIR `dataclass` ¶

GBigSmilesComponentIR(molecule, target_mass=None, mass_is_fraction=False, extras=dict())

Single component entry in a gBigSMILES system.

GBigSmilesMoleculeIR `dataclass` ¶

GBigSmilesMoleculeIR(structure, descriptor_weights=list(), stochastic_metadata=list(), extras=dict())

gBigSMILES molecule = structure + generative metadata.

GBigSmilesSystemIR `dataclass` ¶

GBigSmilesSystemIR(molecules=list(), total_mass=None)

gBigSMILES system describing an ensemble of molecules.

PolymerSegment `dataclass` ¶

PolymerSegment(monomers, composition_type=None, distribution_params=None, end_groups=list(), repeat_units_ir=list(), end_groups_ir=list())

Polymer segment specification.

PolymerSpec `dataclass` ¶

PolymerSpec(segments, topology, start_group_ir=None, end_group_ir=None)

Complete polymer specification.

all_monomers ¶

all_monomers()

Get all structures from all segments.

Source code in src/molpy/parser/smiles/converter.py

def all_monomers(self) -> list[Atomistic]:
    """Get all structures from all segments."""
    return [struct for segment in self.segments for struct in segment.monomers]

RepeatUnitIR `dataclass` ¶

RepeatUnitIR(id=_generate_id(), graph=BigSmilesSubgraphIR(), extras=dict())

Repeat unit captured inside a stochastic object.

SmilesAtomIR `dataclass` ¶

SmilesAtomIR(id=_generate_id(), element=None, aromatic=False, charge=None, hydrogens=None, extras=dict())

Intermediate representation for a SMILES atom.

SmilesBondIR `dataclass` ¶

SmilesBondIR(itom, jtom, order=1, stereo=None, id=_generate_id())

Intermediate representation for a SMILES bond.

Bonds directly reference AtomIR objects, not just IDs.

SmilesGraphIR `dataclass` ¶

SmilesGraphIR(atoms=list(), bonds=list())

Root-level IR for SMILES parser.

Represents a molecular graph with atoms and bonds. This is the output of the SMILES parser.

StochasticObjectIR `dataclass` ¶

StochasticObjectIR(id=_generate_id(), terminals=TerminalDescriptorIR(), repeat_units=list(), end_groups=list(), extras=dict())

Container for repeat units, terminals, and end groups.

TerminalDescriptorIR `dataclass` ¶

TerminalDescriptorIR(descriptors=list(), extras=dict())

Terminal brackets that hold descriptors for stochastic objects.

bigsmilesir_to_monomer ¶

bigsmilesir_to_monomer(ir)

Convert BigSmilesMoleculeIR to Atomistic structure (topology only).

Single responsibility: IR → Atomistic conversion only. Parsing should be done separately.

Supports BigSMILES with stochastic object: {[<]CC[>]} (ONE repeat unit only)

Parameters:

Name	Type	Description	Default
`ir`	`BigSmilesMoleculeIR`	BigSmilesMoleculeIR from parser	required

Returns:

Type	Description
`Atomistic`	Atomistic structure with ports marked on atoms, NO positions

Raises:

Type	Description
`ValueError`	If IR contains multiple repeat units (use bigsmilesir_to_polymerspec instead)

Examples:

>>> from molpy.parser.smiles import parse_bigsmiles
>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> struct = bigsmilesir_to_monomer(ir)
>>> # Ports are marked on atoms: atom["port"] = "<" or ">"

Source code in src/molpy/parser/smiles/converter.py

def bigsmilesir_to_monomer(ir: BigSmilesMoleculeIR) -> Atomistic:
    """
    Convert BigSmilesMoleculeIR to Atomistic structure (topology only).

    Single responsibility: IR → Atomistic conversion only.
    Parsing should be done separately.

    Supports BigSMILES with stochastic object: {[<]CC[>]} (ONE repeat unit only)

    Args:
        ir: BigSmilesMoleculeIR from parser

    Returns:
        Atomistic structure with ports marked on atoms, NO positions

    Raises:
        ValueError: If IR contains multiple repeat units (use bigsmilesir_to_polymerspec instead)

    Examples:
        >>> from molpy.parser.smiles import parse_bigsmiles
        >>> ir = parse_bigsmiles("{[<]CC[>]}")
        >>> struct = bigsmilesir_to_monomer(ir)
        >>> # Ports are marked on atoms: atom["port"] = "<" or ">"
    """
    monomers = []

    # Extract from stochastic objects
    for stoch_obj in ir.stochastic_objects:
        for repeat_unit in stoch_obj.repeat_units:
            monomer = create_monomer_from_repeat_unit(repeat_unit, stoch_obj)
            if monomer is not None:
                monomers.append(monomer)

    if len(monomers) == 1:
        return monomers[0]
    elif len(monomers) > 1:
        raise ValueError(
            f"BigSmilesMoleculeIR contains {len(monomers)} repeat units. "
            "Use bigsmilesir_to_polymerspec() for multiple repeat units."
        )

    raise ValueError(
        "BigSmilesMoleculeIR contains no repeat units. " "Use {[<]...[>]} format."
    )

bigsmilesir_to_polymerspec ¶

bigsmilesir_to_polymerspec(ir)

Convert BigSmilesIR to complete polymer specification.

Single responsibility: IR -> PolymerSpec conversion only. Parsing should be done separately.

Extracts monomers and analyzes polymer topology and composition.

Parameters:

Name	Type	Description	Default
`ir`	`BigSmilesMoleculeIR`	BigSmilesIR from parser	required

Returns:

Type	Description
`PolymerSpec`	PolymerSpec with segments, topology, and all monomers

Examples:

>>> from molpy.parser.smiles import parse_bigsmiles
>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> spec = bigsmilesir_to_polymerspec(ir)
>>> spec.topology
'homopolymer'

Source code in src/molpy/parser/smiles/converter.py

def bigsmilesir_to_polymerspec(ir: BigSmilesMoleculeIR) -> PolymerSpec:
    """
    Convert BigSmilesIR to complete polymer specification.

    Single responsibility: IR -> PolymerSpec conversion only.
    Parsing should be done separately.

    Extracts monomers and analyzes polymer topology and composition.

    Args:
        ir: BigSmilesIR from parser

    Returns:
        PolymerSpec with segments, topology, and all monomers

    Examples:
        >>> from molpy.parser.smiles import parse_bigsmiles
        >>> ir = parse_bigsmiles("{[<]CC[>]}")
        >>> spec = bigsmilesir_to_polymerspec(ir)
        >>> spec.topology
        'homopolymer'
    """
    return extract_polymerspec_from_ir(ir)

parse_bigsmiles ¶

parse_bigsmiles(src)

Parse a BigSMILES string into BigSmilesMoleculeIR.

This parser accepts BigSMILES syntax including stochastic objects, bond descriptors, and repeat units. It does NOT accept GBigSMILES annotations.

Parameters:

Name	Type	Description	Default
`src`	`str`	BigSMILES string	required

Returns:

Type	Description
`BigSmilesMoleculeIR`	BigSmilesMoleculeIR containing backbone and stochastic objects

Raises:

Type	Description
`ValueError`	if syntax errors detected

Examples:

>>> ir = parse_bigsmiles("{[<]CC[>]}")
>>> len(ir.stochastic_objects)
1

Source code in src/molpy/parser/smiles/__init__.py

def parse_bigsmiles(src: str) -> BigSmilesMoleculeIR:
    """
    Parse a BigSMILES string into BigSmilesMoleculeIR.

    This parser accepts BigSMILES syntax including stochastic objects,
    bond descriptors, and repeat units. It does NOT accept GBigSMILES
    annotations.

    Args:
        src: BigSMILES string

    Returns:
        BigSmilesMoleculeIR containing backbone and stochastic objects

    Raises:
        ValueError: if syntax errors detected

    Examples:
        >>> ir = parse_bigsmiles("{[<]CC[>]}")
        >>> len(ir.stochastic_objects)
        1
    """
    return _bigsmiles_parser.parse(src)

parse_cgsmiles ¶

parse_cgsmiles(src)

Parse a CGSmiles string.

Parameters:

Name	Type	Description	Default
`src`	`str`	CGSmiles string (e.g., "{[#PEO][#PMA]}.{#PEO=[$]COC[$]}")	required

Returns:

Type	Description
`CGSmilesIR`	CGSmilesIR with base graph and fragment definitions

Raises:

Type	Description
`ValueError`	if syntax errors detected

Examples:

>>> result = parse_cgsmiles("{[#PEO][#PMA][#PEO]}")
>>> len(result.base_graph.nodes)
3
>>> result = parse_cgsmiles("{[#PEO]|5}")
>>> len(result.base_graph.nodes)
5

Source code in src/molpy/parser/smiles/cgsmiles_parser.py

def parse_cgsmiles(src: str) -> CGSmilesIR:
    """Parse a CGSmiles string.

    Args:
        src: CGSmiles string (e.g., "{[#PEO][#PMA]}.{#PEO=[$]COC[$]}")

    Returns:
        CGSmilesIR with base graph and fragment definitions

    Raises:
        ValueError: if syntax errors detected

    Examples:
        >>> result = parse_cgsmiles("{[#PEO][#PMA][#PEO]}")
        >>> len(result.base_graph.nodes)
        3
        >>> result = parse_cgsmiles("{[#PEO]|5}")
        >>> len(result.base_graph.nodes)
        5
    """
    return _parser.parse(src)

parse_gbigsmiles ¶

parse_gbigsmiles(src)

Parse a GBigSMILES string into GBigSmilesSystemIR.

This parser accepts GBigSMILES syntax including all BigSMILES features plus system size specifications and other generative annotations. Always returns GBigSmilesSystemIR, wrapping single molecules in a system structure.

Parameters:

Name	Type	Description	Default
`src`	`str`	GBigSMILES string	required

Returns:

Type	Description
`GBigSmilesSystemIR`	GBigSmilesSystemIR containing the parsed system

Raises:

Type	Description
`ValueError`	if syntax errors detected

Examples:

>>> ir = parse_gbigsmiles("{[<]CC[>]}|5e5|")
>>> isinstance(ir, GBigSmilesSystemIR)
True

Source code in src/molpy/parser/smiles/__init__.py

def parse_gbigsmiles(src: str) -> GBigSmilesSystemIR:
    """
    Parse a GBigSMILES string into GBigSmilesSystemIR.

    This parser accepts GBigSMILES syntax including all BigSMILES
    features plus system size specifications and other generative
    annotations. Always returns GBigSmilesSystemIR, wrapping single
    molecules in a system structure.

    Args:
        src: GBigSMILES string

    Returns:
        GBigSmilesSystemIR containing the parsed system

    Raises:
        ValueError: if syntax errors detected

    Examples:
        >>> ir = parse_gbigsmiles("{[<]CC[>]}|5e5|")
        >>> isinstance(ir, GBigSmilesSystemIR)
        True
    """
    return _gbigsmiles_parser.parse(src)

parse_smiles ¶

parse_smiles(src)

Parse a SMILES string into SmilesGraphIR or list of SmilesGraphIR.

This parser only accepts pure SMILES syntax. It will reject BigSMILES or GBigSMILES constructs.

For dot-separated SMILES (e.g., "C.C", "CC.O"), returns a list of SmilesGraphIR, one for each disconnected component.

Parameters:

Name	Type	Description	Default
`src`	`str`	SMILES string (may contain dots for mixtures)	required

Returns:

Type	Description
`SmilesGraphIR \| list[SmilesGraphIR]`	SmilesGraphIR for single molecule, or list[SmilesGraphIR] for mixtures

Raises:

Type	Description
`ValueError`	if syntax errors detected or unclosed rings

Examples:

>>> ir = parse_smiles("CCO")
>>> len(ir.atoms)
3
>>> irs = parse_smiles("C.C")
>>> len(irs)
2

Source code in src/molpy/parser/smiles/__init__.py

def parse_smiles(src: str) -> SmilesGraphIR | list[SmilesGraphIR]:
    """
    Parse a SMILES string into SmilesGraphIR or list of SmilesGraphIR.

    This parser only accepts pure SMILES syntax. It will reject
    BigSMILES or GBigSMILES constructs.

    For dot-separated SMILES (e.g., "C.C", "CC.O"), returns a list of
    SmilesGraphIR, one for each disconnected component.

    Args:
        src: SMILES string (may contain dots for mixtures)

    Returns:
        SmilesGraphIR for single molecule, or list[SmilesGraphIR] for mixtures

    Raises:
        ValueError: if syntax errors detected or unclosed rings

    Examples:
        >>> ir = parse_smiles("CCO")
        >>> len(ir.atoms)
        3
        >>> irs = parse_smiles("C.C")
        >>> len(irs)
        2
    """
    return _smiles_parser.parse(src)

Parser¶

Base¶

GrammarConfig dataclass ¶

GrammarParserBase ¶

parse_tree ¶

SMARTS¶

AtomExpressionIR dataclass ¶

AtomPrimitiveIR dataclass ¶

SmartsAtomIR dataclass ¶

SmartsBondIR dataclass ¶

SmartsIR dataclass ¶

SmartsParser ¶

parse_smarts ¶

SmartsTransformer ¶

and_expression ¶

atom ¶

atom_id ¶

+ atomic_num (atomic number)¶

atom_label ¶

atom_symbol ¶

atomic_num ¶

branch ¶

has_label ¶

hydrogen_count ¶

implicit_hydrogen_count ¶

matches_string ¶

neighbor_count ¶

nonlastbranch ¶

not_expression ¶

or_expression ¶

ring_count ¶

ring_size ¶

start ¶

weak_and_expression ¶

SMILES¶

BigSmilesMoleculeIR dataclass ¶

BigSmilesSubgraphIR dataclass ¶

BondingDescriptorIR dataclass ¶

CGSmilesBondIR dataclass ¶

CGSmilesFragmentIR dataclass ¶

CGSmilesGraphIR dataclass ¶

CGSmilesIR dataclass ¶

CGSmilesNodeIR dataclass ¶

DistributionIR dataclass ¶

EndGroupIR dataclass ¶

GBBondingDescriptorIR dataclass ¶

GBStochasticObjectIR dataclass ¶

GBigSmilesComponentIR dataclass ¶

GBigSmilesMoleculeIR dataclass ¶

GBigSmilesSystemIR dataclass ¶

PolymerSegment dataclass ¶

PolymerSpec dataclass ¶

all_monomers ¶

RepeatUnitIR dataclass ¶

SmilesAtomIR dataclass ¶

SmilesBondIR dataclass ¶

SmilesGraphIR dataclass ¶

StochasticObjectIR dataclass ¶

TerminalDescriptorIR dataclass ¶

bigsmilesir_to_monomer ¶

bigsmilesir_to_polymerspec ¶

parse_bigsmiles ¶

parse_cgsmiles ¶

parse_gbigsmiles ¶

parse_smiles ¶

GrammarConfig `dataclass` ¶

AtomExpressionIR `dataclass` ¶

AtomPrimitiveIR `dataclass` ¶

SmartsAtomIR `dataclass` ¶

SmartsBondIR `dataclass` ¶

SmartsIR `dataclass` ¶

BigSmilesMoleculeIR `dataclass` ¶

BigSmilesSubgraphIR `dataclass` ¶

BondingDescriptorIR `dataclass` ¶

CGSmilesBondIR `dataclass` ¶

CGSmilesFragmentIR `dataclass` ¶

CGSmilesGraphIR `dataclass` ¶

CGSmilesIR `dataclass` ¶

CGSmilesNodeIR `dataclass` ¶

DistributionIR `dataclass` ¶

EndGroupIR `dataclass` ¶

GBBondingDescriptorIR `dataclass` ¶

GBStochasticObjectIR `dataclass` ¶

GBigSmilesComponentIR `dataclass` ¶

GBigSmilesMoleculeIR `dataclass` ¶

GBigSmilesSystemIR `dataclass` ¶

PolymerSegment `dataclass` ¶

PolymerSpec `dataclass` ¶

RepeatUnitIR `dataclass` ¶

SmilesAtomIR `dataclass` ¶

SmilesBondIR `dataclass` ¶

SmilesGraphIR `dataclass` ¶

StochasticObjectIR `dataclass` ¶

TerminalDescriptorIR `dataclass` ¶