Cheminformatics file formats — from 1971 to today

Line notations describe a molecule in a single string. File formats describe the rest — coordinates, stereo bonds drawn explicitly, molecule sets, reactions, crystal cells, attached metadata, attached spectra. The PDB format (1971), MDL's Mol/SDF family (1981 onwards), and the IUCr's CIF (1991) are the load-bearing pieces of this ecosystem; everything else extends or substitutes for them. This is the second of three articles on how chemistry gets written down — the first covered line notations, the third covers spectroscopy formats.

Why file formats and notations are different problems

Caffeine is 28 characters in SMILES (CN1C=NC2=C1C(=O)N(C(=O)N2C)C) and several hundred bytes in a Mol V2000 file — measured from RDKit's Chem.MolToMolBlock. The Mol file gets bigger because it carries more: 2D layout coordinates, an explicit hydrogen count, query atoms, R-group attachment points, isotope labels, and a name field — all of which a notation leaves implicit. The trade is exactly what you'd expect: notations are short and computable, file formats are verbose and human-editable, and every chemistry tool in the world has to read both.

Three jobs

Single-molecule — store one molecule with full 2D/3D detail. Mol files (V2000, V3000), Mol2, CDXML, XYZ.
Multi-molecule — bundle many molecules with metadata. SDF, MOL2 collections, multi-record CDXML.
Structural / crystallographic — proteins, polymers, crystals. PDB, mmCIF, CIF.

Reaction files (MDL RXN, RDfile) live in Reactions and retrosynthesis — they share the Mol-block format with the rest of this article but the surrounding semantics belong with reaction notations and retrosynthesis tools.

A chronological tour

Same shape as the line-notations article: each format gets a numbered block with what it adds and how it differs from the row above. Ethanol is the running example where it makes sense; PDB and CIF use a small protein or crystal structure instead.

1. PDB — 1971

Protein Data Bank format. Fixed-column ASCII; every line begins with a record-type keyword (HEADER, ATOM, HETATM, CONECT). Designed for X-ray crystal structures of proteins; predates SMILES by 17 years. Still the lingua franca for biomolecular structure, despite the field's slow migration to mmCIF.

What it adds: 3D coordinates, residue chains, and secondary-structure annotations — none of which a small-molecule format tries to carry.

2. MDL Mol — V2000 — 1981

Molfile, also known as the CTfile (connection-table file). Originally MDL Information Systems, now maintained by BIOVIA / Dassault. The workhorse: every cheminformatics tool reads it. Header (3 lines) + counts line + atom block + bond block + property block + M END.

What it adds: 2D / 3D coordinates per atom, explicit bond stereo (wedge/dash/up/down), query bits in the property block, names, and comments — none of which fit in a SMILES.

3. SDF — 1982

Structure-Data File. A V2000 (or V3000) Mol block followed by > <tagname> ... $$$$ separator. Each record is one molecule plus arbitrary tag/value metadata. Most public datasets ship as SDF — PubChem, ChEMBL, ZINC.

What it adds: arbitrary key/value metadata per molecule, without inventing a new schema. Property tags can be anything: assay results, vendor IDs, computed descriptors. The format that turned the V2000 Mol from a structure container into a mini-database row.

4. Mol2 — 1991

Tripos Mol2. ASCII; sections delimited by @<TRIPOS> headers (ATOM, BOND, MOLECULE, etc.). Adds explicit atom types in the Tripos / Sybyl atom-typing scheme — useful for force-field assignment in molecular dynamics.

What it adds: first-class atom types, partial charges per atom, and an explicit substructure section. Ergonomic for docking and MD pipelines that need typed atoms before geometry.

5. CIF — 1991

Crystallographic Information File. IUCr-standardised. Self-describing: every value sits under a _data name, the dictionary defines the semantics. Standard for small-molecule crystallography (mmCIF is the macromolecular extension). Carries unit cell, symmetry operations, atom fractional coordinates, displacement parameters, refinement statistics.

What it adds: a self-describing format whose dictionary is the schema. New domains extend by registering new keys; old readers ignore unknown keys without breaking.

6. XYZ — 1995

Atom count + comment + N lines of element x y z. That's the whole format. No bonds, no charges, no stereo — just geometry. The default input format for most computational-chemistry packages.

What it adds: the simplest possible 3D-coordinates container. When you only need geometry and you're going to redo the bonding from scratch anyway, XYZ is the right minimum.

7. CML — 1999

Chemical Markup Language. XML, namespaced. Murray-Rust et al., 1999. The ambition: an open, standards-based interchange format for everything (small molecules, polymers, reactions, spectra). Implemented widely in the Java cheminformatics world (CDK, JChemPaint); less common elsewhere.

What it adds: a single hierarchical schema for any chemistry payload. Strong validation, transformable with XSLT, but verbose (an ethanol record is ~30 lines).

8. CDXML / CDX — 2002

ChemDraw's native format. CDX is binary, CDXML is the XML serialisation of the same. Carries layout (font, kerning, bond angles) as well as structure, because ChemDraw is a drawing tool first. Every chemist with a Mac knows this format whether they want to or not.

What it adds: publication-ready 2D drawings. The only format that round-trips through journal submission systems without losing the figure.

9. Mol V3000 — 2002

Extended Mol file. Lifts V2000's hard limit of 999 atoms / 999 bonds, and introduces a tagged-line format that's more parser-friendly than the fixed-column V2000 layout. Adds proper Sgroup support, enhanced stereochemistry, and templates.

What it adds: scale (peptides, polymers, complex natural products) and unambiguous extension points. V2000 is still more common in the wild; V3000 is what you reach for when V2000 runs out of columns.

10. mmCIF — 2014

Macromolecular CIF. PDB's official replacement for the legacy PDB format. Same self-describing model as small-molecule CIF, plus dictionaries for polymers, ligands, refinement, and validation. As of 2014, the RCSB PDB and PDBe both treat mmCIF as the primary archive format.

What it adds: a single dictionary-driven format for every structural-biology payload, replacing PDB's column-fixed legacy. New record types extend the dictionary instead of the format.

11. CXSMILES — c. 2007

ChemAxon's extended SMILES. A SMILES string followed by a vertical bar and a structured extension block: 2D coordinates, atom labels, Sgroup definitions, query bits — basically everything a Mol file carries that a plain SMILES cannot. The bridge between "single-line notation" and "file format". Both RDKit and OpenBabel parse the standard CXSMILES subset.

What it adds: a one-line representation that retains the Mol file's structural detail. When you want SMILES ergonomics but cannot lose 2D coordinates, CXSMILES is the right tool.

Side-by-side

What each format actually carries. The columns describe data the format is specified to encode, drawn from each format's published specification. Size comparisons are intentionally absent — they vary widely with coordinate precision, line endings, and embedded metadata, and the verified-bytes audit is in flight.

Format	Year	Kind	Carries	What it adds
PDB	1971	Structural	Atoms, residues, chains, secondary structure, het records	Lingua franca for biomolecular structure
Mol V2000	1981	Single-mol	Atoms, bonds, 2D/3D coords, bond stereo, query bits	Workhorse 2D-aware container
SDF	1982	Multi-mol	N × Mol blocks, tag/value metadata per record	Per-record key/value metadata
Mol2	1991	Single-mol	Tripos atom types, partial charges, substructures	First-class typed atoms for MD/docking
CIF	1991	Crystal	Unit cell, symmetry ops, atom fractional coords, dictionary	Self-describing dictionary-extensible format
XYZ	1995	Geometry	N atom-element-coord lines; nothing else	Minimum 3D-coords container
CML	1999	Single-mol	Atoms, bonds, namespaces; XML schema-validatable	Single hierarchical schema for chemistry XML
CDXML	2002	Single-mol	Atoms, bonds, fonts, kerning, drawing-element layout	Publication-ready 2D drawings
Mol V3000	2002	Single-mol	V2000 features + tagged lines; Sgroups; >999 atoms	Scale + parser-friendly extension points
mmCIF	2014	Structural	PDB content via CIF dictionary; ligand + validation dicts	Dictionary-extensible structural archive
CXSMILES	c. 2007	Single-mol	SMILES + 2D coords + Sgroups + atom labels in one line	Mol-file detail in a single SMILES line

Sources: each row is taken from the format's published specification or vendor documentation. "Carries" lists data the format is specified to encode, not optional extensions.

Bringing them together

File formats are the second axis of fragmentation. A chemist sketching a structure in ChemDraw, a curator enriching it in PubChem, and a docking pipeline pulling it from ZINC are looking at the same atoms, but the on-disk shape — bytes, fields, conventions — disagrees byte-for-byte.

chempirical's plan: the same graph-shaped database engine that backs the search box also handles every format here, on the way in. Drop a Mol file, an SDF, a CDXML, a CIF, and the engine normalises into the same internal graph — connectivity is the source of truth; format is just a serialisation. On the way out, the same graph re-emits to whatever format the caller asks for. Round-trip fidelity is verified by hashing the InChI of the parsed structure, not the bytes on disk.