The dreid-typer library operates on a robust pipeline model, where data is systematically transformed through a series of well-defined stages. The elegance and reliability of this pipeline are rooted in the design of three distinct data structures, each representing a different level of abstraction for a molecule. This document details these structures and their roles in the data flow.
The core philosophy is the evolution of information: from raw connectivity to a chemically-aware representation, and finally to a simulation-ready physical model.
The MolecularGraph is the sole entry point for user data into the system. It represents the lowest level of chemical abstraction: pure connectivity.
- Purpose: To define a molecule using only the most fundamental information: atoms (element and formal charge) and the bonds connecting them.
- Structure:
- A list of
AtomNodes. - A list of
BondEdges.
- A list of
- Design Rationale:
- Simplicity: The API is designed to be minimal and intuitive. Users are not required to pre-calculate any chemical properties like hybridization or ring membership. They simply define what they know: the atoms and their connections.
- Decoupling: It decouples the user's representation of a molecule from the library's internal processing logic. This ensures a stable and predictable input contract.
// A conceptual representation of MolecularGraph's simplicity
pub struct MolecularGraph {
pub atoms: Vec<AtomNode { element, .. }>,
pub bonds: Vec<BondEdge { atom_ids, order, .. }>,
}Once a MolecularGraph enters the pipeline, it is immediately converted into an AnnotatedMolecule (defined in perception::model). This is the most complex data structure in the library, serving as the central, chemically-aware workspace for the core algorithms.
- Purpose: To hold a rich, comprehensive set of perceived chemical properties for every atom. It is the single source of truth for the typing and building phases.
- Structure:
- A list of
AnnotatedAtoms, where each entry contains numerous fields:- Intrinsic properties (
element,formal_charge). - Topological properties (
degree,is_in_ring,smallest_ring_size). - Electronic properties (
lone_pairs,steric_number,hybridization). - Aromaticity and resonance flags (
is_aromatic,is_anti_aromatic,is_resonant).
- Intrinsic properties (
- An adjacency list for efficient neighbor traversal.
- A list of
- Design Rationale:
- Centralized Knowledge: By pre-calculating and storing all relevant properties in one place, the subsequent typing and building phases can be implemented as efficient, stateless queries against this data structure. This avoids redundant calculations.
- Factual Immutability: The
AnnotatedMoleculeis constructed once during the Perception Phase and is treated as a read-only object thereafter. This immutability guarantees that the typing engine operates on a consistent and deterministic chemical context.
The MolecularTopology is the final product of the pipeline. It is a clean, structured representation tailored specifically for consumption by molecular simulation engines.
- Purpose: To provide a complete list of all particles and interaction terms (bonds, angles, torsions, inversions) required to define a DREIDING force field model.
- Structure:
- A list of final
Atoms, now including their assignedatom_type. - Deduplicated lists of
Bonds,Angles,Torsions, andInversions.
- A list of final
- Design Rationale:
- Simulation-Oriented: The structure directly maps to the needs of a simulation setup. It discards intermediate perception data (like
lone_pairsorsteric_number) that is not directly part of the final force field definition. - Canonical Representation: Each topological component (
Angle,Torsion,Inversion) is stored in a canonical form (e.g., atom indices are sorted). This simplifies consumption by downstream tools, as it eliminates ambiguity and the need for further deduplication.
- Simulation-Oriented: The structure directly maps to the needs of a simulation setup. It discards intermediate perception data (like
The entire process is a single, deterministic flow from the simplest input to the most structured output.
graph LR
subgraph "User-Facing API"
A(<b>MolecularGraph</b><br><i>Simple & Raw</i>)
end
subgraph "Internal Processing"
B(<b>AnnotatedMolecule</b><br><i>Complex & Chemically-Aware</i>)
end
subgraph "Engine-Ready Output"
C(<b>MolecularTopology</b><br><i>Structured & Canonical</i>)
end
A -- "<b>Phase 1: Perception</b><br>Information is created and annotated" --> B
B -- "<b>Phase 2 & 3: Typing & Building</b><br>Information is queried and transformed" --> C
-
Input to Workspace (
MolecularGraph->AnnotatedMolecule): Theperception::perceivefunction acts as the constructor for theAnnotatedMolecule. It takes the minimalMolecularGraphand performs all necessary chemical computations to build a fully annotated, "intelligent" graph. This is the most computationally intensive part of the process, where raw data is converted into chemical knowledge. -
Workspace to Output (
AnnotatedMolecule->MolecularTopology): Thetyping::engine::assign_typesandbuilder::build_topologyfunctions work in concert to transform the richAnnotatedMoleculeinto the final, leanMolecularTopology. This stage is not about discovering new information, but rather about querying the existing knowledge and formatting it according to the rules of the DREIDING model. The typing engine queries atomic properties to assign types, and the builder queries connectivity to generate geometric terms.
By strictly separating these data representations, dreid-typer achieves a clean architecture that is both robust and easy to reason about.