Line notations describe a molecule in a single string. File formats describe the rest — coordinates, stereo bonds drawn explicitly, molecule sets, reactions, crystal cells, attached metadata, attached spectra. The PDB format (1971), MDL's Mol/SDF family (1981 onwards), and the IUCr's CIF (1991) are the load-bearing pieces of this ecosystem; everything else extends or substitutes for them. This is the second of three articles on how chemistry gets written down — the first covered line notations, the third covers spectroscopy formats.
Formats covered
Why file formats and notations are different problems
Caffeine is 28 characters in SMILES (CN1C=NC2=C1C(=O)N(C(=O)N2C)C) and several hundred bytes in a Mol V2000 file — measured from RDKit's Chem.MolToMolBlock. The Mol file gets bigger because it carries more: 2D layout coordinates, an explicit hydrogen count, query atoms, R-group attachment points, isotope labels, and a name field — all of which a notation leaves implicit. The trade is exactly what you'd expect: notations are short and computable, file formats are verbose and human-editable, and every chemistry tool in the world has to read both.
Three jobs
- Single-molecule — store one molecule with full 2D/3D detail. Mol files (V2000, V3000), Mol2, CDXML, XYZ.
- Multi-molecule — bundle many molecules with metadata. SDF, MOL2 collections, multi-record CDXML.
- Structural / crystallographic — proteins, polymers, crystals. PDB, mmCIF, CIF.
Reaction files (MDL RXN, RDfile) live in Reactions and retrosynthesis — they share the Mol-block format with the rest of this article but the surrounding semantics belong with reaction notations and retrosynthesis tools.
A chronological tour
Same shape as the line-notations article: each format gets a numbered block with what it adds and how it differs from the row above. Ethanol is the running example where it makes sense; PDB and CIF use a small protein or crystal structure instead.
1. PDB — 1971
Protein Data Bank format. Fixed-column ASCII; every line begins with a record-type keyword (HEADER, ATOM, HETATM, CONECT). Designed for X-ray crystal structures of proteins; predates SMILES by 17 years. Still the lingua franca for biomolecular structure, despite the field's slow migration to mmCIF.
What it adds: 3D coordinates, residue chains, and secondary-structure annotations — none of which a small-molecule format tries to carry.
2. MDL Mol — V2000 — 1981
Molfile, also known as the CTfile (connection-table file). Originally MDL Information Systems, now maintained by BIOVIA / Dassault. The workhorse: every cheminformatics tool reads it. Header (3 lines) + counts line + atom block + bond block + property block + M END.
What it adds: 2D / 3D coordinates per atom, explicit bond stereo (wedge/dash/up/down), query bits in the property block, names, and comments — none of which fit in a SMILES.
3. SDF — 1982
Structure-Data File. A V2000 (or V3000) Mol block followed by > <tagname> ... $$$$ separator. Each record is one molecule plus arbitrary tag/value metadata. Most public datasets ship as SDF — PubChem, ChEMBL, ZINC.
What it adds: arbitrary key/value metadata per molecule, without inventing a new schema. Property tags can be anything: assay results, vendor IDs, computed descriptors. The format that turned the V2000 Mol from a structure container into a mini-database row.
4. Mol2 — 1991
Tripos Mol2. ASCII; sections delimited by @<TRIPOS> headers (ATOM, BOND, MOLECULE, etc.). Adds explicit atom types in the Tripos / Sybyl atom-typing scheme — useful for force-field assignment in molecular dynamics.
What it adds: first-class atom types, partial charges per atom, and an explicit substructure section. Ergonomic for docking and MD pipelines that need typed atoms before geometry.
5. CIF — 1991
Crystallographic Information File. IUCr-standardised. Self-describing: every value sits under a _data name, the dictionary defines the semantics. Standard for small-molecule crystallography (mmCIF is the macromolecular extension). Carries unit cell, symmetry operations, atom fractional coordinates, displacement parameters, refinement statistics.
What it adds: a self-describing format whose dictionary is the schema. New domains extend by registering new keys; old readers ignore unknown keys without breaking.
6. XYZ — 1995
Atom count + comment + N lines of element x y z. That's the whole format. No bonds, no charges, no stereo — just geometry. The default input format for most computational-chemistry packages.
What it adds: the simplest possible 3D-coordinates container. When you only need geometry and you're going to redo the bonding from scratch anyway, XYZ is the right minimum.
7. CML — 1999
Chemical Markup Language. XML, namespaced. Murray-Rust et al., 1999. The ambition: an open, standards-based interchange format for everything (small molecules, polymers, reactions, spectra). Implemented widely in the Java cheminformatics world (CDK, JChemPaint); less common elsewhere.
What it adds: a single hierarchical schema for any chemistry payload. Strong validation, transformable with XSLT, but verbose (an ethanol record is ~30 lines).
8. CDXML / CDX — 2002
ChemDraw's native format. CDX is binary, CDXML is the XML serialisation of the same. Carries layout (font, kerning, bond angles) as well as structure, because ChemDraw is a drawing tool first. Every chemist with a Mac knows this format whether they want to or not.
What it adds: publication-ready 2D drawings. The only format that round-trips through journal submission systems without losing the figure.
9. Mol V3000 — 2002
Extended Mol file. Lifts V2000's hard limit of 999 atoms / 999 bonds, and introduces a tagged-line format that's more parser-friendly than the fixed-column V2000 layout. Adds proper Sgroup support, enhanced stereochemistry, and templates.
What it adds: scale (peptides, polymers, complex natural products) and unambiguous extension points. V2000 is still more common in the wild; V3000 is what you reach for when V2000 runs out of columns.
10. mmCIF — 2014
Macromolecular CIF. PDB's official replacement for the legacy PDB format. Same self-describing model as small-molecule CIF, plus dictionaries for polymers, ligands, refinement, and validation. As of 2014, the RCSB PDB and PDBe both treat mmCIF as the primary archive format.
What it adds: a single dictionary-driven format for every structural-biology payload, replacing PDB's column-fixed legacy. New record types extend the dictionary instead of the format.
11. CXSMILES — c. 2007
ChemAxon's extended SMILES. A SMILES string followed by a vertical bar and a structured extension block: 2D coordinates, atom labels, Sgroup definitions, query bits — basically everything a Mol file carries that a plain SMILES cannot. The bridge between "single-line notation" and "file format". Both RDKit and OpenBabel parse the standard CXSMILES subset.
What it adds: a one-line representation that retains the Mol file's structural detail. When you want SMILES ergonomics but cannot lose 2D coordinates, CXSMILES is the right tool.
Side-by-side
What each format actually carries. The columns describe data the format is specified to encode, drawn from each format's published specification. Size comparisons are intentionally absent — they vary widely with coordinate precision, line endings, and embedded metadata, and the verified-bytes audit is in flight.
| Format | Year | Kind | Carries | What it adds |
|---|---|---|---|---|
| PDB | 1971 | Structural | Atoms, residues, chains, secondary structure, het records | Lingua franca for biomolecular structure |
| Mol V2000 | 1981 | Single-mol | Atoms, bonds, 2D/3D coords, bond stereo, query bits | Workhorse 2D-aware container |
| SDF | 1982 | Multi-mol | N × Mol blocks, tag/value metadata per record | Per-record key/value metadata |
| Mol2 | 1991 | Single-mol | Tripos atom types, partial charges, substructures | First-class typed atoms for MD/docking |
| CIF | 1991 | Crystal | Unit cell, symmetry ops, atom fractional coords, dictionary | Self-describing dictionary-extensible format |
| XYZ | 1995 | Geometry | N atom-element-coord lines; nothing else | Minimum 3D-coords container |
| CML | 1999 | Single-mol | Atoms, bonds, namespaces; XML schema-validatable | Single hierarchical schema for chemistry XML |
| CDXML | 2002 | Single-mol | Atoms, bonds, fonts, kerning, drawing-element layout | Publication-ready 2D drawings |
| Mol V3000 | 2002 | Single-mol | V2000 features + tagged lines; Sgroups; >999 atoms | Scale + parser-friendly extension points |
| mmCIF | 2014 | Structural | PDB content via CIF dictionary; ligand + validation dicts | Dictionary-extensible structural archive |
| CXSMILES | c. 2007 | Single-mol | SMILES + 2D coords + Sgroups + atom labels in one line | Mol-file detail in a single SMILES line |
Bringing them together
File formats are the second axis of fragmentation. A chemist sketching a structure in ChemDraw, a curator enriching it in PubChem, and a docking pipeline pulling it from ZINC are looking at the same atoms, but the on-disk shape — bytes, fields, conventions — disagrees byte-for-byte.
chempirical's plan: the same graph-shaped database engine that backs the search box also handles every format here, on the way in. Drop a Mol file, an SDF, a CDXML, a CIF, and the engine normalises into the same internal graph — connectivity is the source of truth; format is just a serialisation. On the way out, the same graph re-emits to whatever format the caller asks for. Round-trip fidelity is verified by hashing the InChI of the parsed structure, not the bytes on disk.