0 25 50 100 140 Caffeine string length 1950 1960 1970 1980 1990 2000 2010 2020 WLN 25 CAS-RN 7 IUPAC 31 SMILES 28 Canonical SMILES 26 InChI 67 InChIKey 27 DeepSMILES 24 SELFIES 116 Balsa 26 Group SELFIES 138 SAFE 26
Year × caffeine string length. Click a node to jump to its section.

Every cheminformatics workflow starts with the same question: how do you write a molecule down? It sounds trivial. It is not. The line notations chemists use today are the result of seventy-five years of arguments about what a molecule even is — a graph, a string, a hash, a fragment alphabet, a name in a 1500-page rulebook. Each one was invented to fix a specific shortcoming of what came before. None of them retired.

Notations covered

Why this is harder than it looks

A molecule is a graph: atoms are nodes, bonds are edges, plus a few side constraints (charge, isotope, stereo, aromaticity). To put it in a column of a database, you have to flatten that graph into a string. Flattening loses information unless you are careful, and even when you are, two implementations can write the same molecule down differently — which means a literal WHERE smiles = ? query misses half the matching records.

The history of line notations is a history of trying to fix this. Some fixes are algorithmic (canonicalisation). Some are structural (encode the graph differently). Some are bureaucratic (write a 1500-page rulebook). Each new notation augments — rather than replaces — the ones it was reacting to.

Three jobs, not one

Three categories sit underneath the rest of the article.

  • Notation — a string that describes this molecule. SMILES, SELFIES, Balsa.
  • Identifier — a string that names a molecule canonically, so two implementations agree byte-for-byte. InChI, InChIKey, CAS Registry Number.
  • Pattern / reaction — a string that describes a set of molecules, or a transformation between them. SMARTS, SMIRKS, RInChI.

The same character set (Latin letters, digits, brackets) is reused across all three, which is why a SMARTS string and a SMILES string look almost identical until you notice the wildcards.

A chronological tour

Caffeine is the running example. Each block shows what the notation looks like, what it adds over the row above, and links to the underlying paper.

1. WLN — 1949

Wiswesser Line Notation. William Wiswesser was an industrial chemist who needed to write molecule structures on a typewriter. WLN encoded fused ring systems with letter-and-digit qualifiers (T6 for a six-membered alicyclic ring, L for carbocyclic, M for nitrogen) and a deterministic citation order with multiple priority rules for selecting the start atom and traversal direction.1 A 1970s patent abstract may still contain a string like L66J BVQ DV1NN1. The most accurate modern parser2 recovers about 75% of WLN strings in real archives; the residual failure rate is dominated by malformed source strings.

What it adds: typewriter-era graph encoding; the first compact line representation that survived as a real-world index for forty years.

2. CAS Registry Number — 1965

CAS Registry Number, Chemical Abstracts Service. Three groups of digits separated by hyphens; the rightmost digit is a checksum.

58-08-2
Caffeine. The number is opaque — no structure can be recovered from it.

What it adds: institutional memory. Reagent bottles, safety data sheets, regulatory filings, and IUPAC's own indexing all reference CAS numbers. They are not derivable from structure — every new molecule receives a number assigned by humans at CAS.

3. IUPAC names — 1979 onwards

The IUPAC Nomenclature of Organic Chemistry is the most thorough molecular notation that exists. The current rulebook (the 2013 'Blue Book') runs over 1500 pages.

1,3,7-trimethyl-3,7-dihydro-1H-purine-2,6-dione
Caffeine, IUPAC systematic name. The shorter 1,3,7-trimethylpurine-2,6-dione is also accepted.

Generating a name from a structure is computationally hard and rule-dependent. ACD/Labs, ChemAxon, and OpenEye all sell commercial IUPAC name generators because the open-source ones do not compete. The widely-used free option is STOUT (Rajan et al., 2021)3 — originally an LSTM trained on PubChem name/structure pairs. The 2021 paper reports an average BLEU of about 90% and Tanimoto similarity above 0.9. STOUT V2.0 (2024)4 is the transformer version, trained on near-1B SMILES/IUPAC pairs.

What it adds: a name a human can read aloud and a referee can verify by hand.

4. SMILES — 1988

Simplified Molecular-Input Line-Entry System. David Weininger, 1986–1988, at the U.S. EPA in Duluth.5 The brief was small: a notation a chemist could type in a single line of ASCII, that round-tripped through a parser, and that any modest computer of the day could read.

CN1C=NC2=C1C(=O)N(C(=O)N2C)C
Caffeine, in SMILES. 28 characters describe a 24-atom molecule.

SMILES describes a depth-first traversal of the molecular graph. Atoms are single letters (or two-letter elements in brackets); bonds are inferred or annotated (=, #, :); ring closures are matched digits; branches are parenthesised; stereo is annotated with @, /, \.

What it adds: a grammar simple enough to teach in an afternoon, and a graph-shaped semantics that maps directly onto the data structures of every modern toolkit.

5. Canonical SMILES — 1989+

SMILES is not unique. Caffeine can be written equally validly as CN1C(=O)N(C)c2ncn(C)c2C1=O — same molecule, different traversal start, totally different string. To compare two SMILES strings for chemical identity you have to canonicalise both first. The Morgan algorithm6 ranks atoms by extended connectivity and gives a deterministic traversal order. Every toolkit ships its own implementation. They mostly agree; they do not always agree.78

RDKitCn1c(=O)c2c(ncn2C)n(C)c1=O
OpenEye / PubChemCN1C=NC2=C1C(=O)N(C(=O)N2C)C
Open BabelCn1cnc2c1c(=O)n(C)c(=O)n2C
IndigoCN1C(=O)C2=C(N=CN2C)N(C)C1=O
Four canonical SMILES for caffeine, four toolkits. All four parse back to the same molecule (InChIKey RYYVLZVUVIJVGH-UHFFFAOYSA-N); none are byte-equal. Aligned with star alignment; gaps shown as ·.9a

OpenSMILES9 sits alongside this work: a community effort started by Craig James and Andrew Dalke to document what every implementation actually does, so a parser written today matches a parser written in 1995.

What it adds: a deterministic dedup key inside a single toolkit; a path toward cross-toolkit agreement once all four canonicalisers converge.

6. InChI & InChIKey — 2005 / 2007

IUPAC International Chemical Identifier. Standardised in 2005,12 after roughly a decade of NIST/IUPAC work. The brief was the inverse of SMILES: a canonical identifier that two implementations anywhere in the world will compute identically for the same molecule.

InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
Caffeine, InChI standard form. Each /-section is a layer.

The layers are explicit: formula (C8H10N4O2), connectivity (c1-10-…), hydrogen positions (h4H,…), then optional stereo, isotope, charge, and tautomer layers. Two molecules are identical if and only if their standard InChIs are byte-equal.

InChI strings get long fast. The InChIKey is a 27-character SHA-256-based hash of the full InChI: fixed-length, URL-safe. The first 14 characters are a hash of the connectivity layer; the next 10 contain a hash of the remaining layers plus a version-and-protonation flag; the final character indicates protonation state.13 No real collision has been documented in the wild.14

RYYVLZVUVIJVGH-UHFFFAOYSA-N
Caffeine, InChIKey.

What they add: a single agreed-on canonicalisation algorithm — implemented once, in C, by the InChI Trust — and a short hash that fits in a database column.

7. DeepSMILES — 2018

Noel O'Boyle and Andrew Dalke, 2018.17 The minimum viable fix for SMILES' machine-learning problems: drop matched parentheses (use a single ) per branch close), drop matched ring digits. Same length as SMILES, only one class of mutation breaks validity instead of three.

Cnc=O)ccncn5C))))nC)c6=O
Caffeine, in DeepSMILES (rings + branches mode).

What it adds: a near-zero-cost retrofit for pipelines that already speak SMILES.

8. SELFIES — 2020

Self-Referencing Embedded Strings. Krenn, Häse, Nigam, Friederich, Aspuru-Guzik, 2020.19 Born of a real ML problem: SMILES strings break under mutation. If a generative model trained on SMILES outputs c1ccccc1C(, that is a syntax error — unbalanced parens, no valid molecule. Roughly 99% of random SMILES mutations produce invalid output. SELFIES fix this by design: every syntactically valid SELFIES string maps to a valid molecule.

[C][N][C][=Branch1][C][=O][C][=C][Branch1][#Branch1][N][=C][N][Ring1][Branch1][C][N][Branch1][C][C][C][Ring1][N][=O]
Caffeine, in SELFIES (v2.x grammar).

What it adds: a closed grammar where the parser, not a downstream filter, enforces validity.

9. Balsa — 2022

Balsa: A Compact Line Notation Based on SMILES. Richard L. Apodaca (Metamolecular, LLC), 2022.20 Balsa keeps SMILES' character set and looks identical for most molecules — ethanol is CCO in both, aspirin is CC(=O)Oc1ccccc1C(=O)O. The break is in the grammar, not the strings: every place where SMILES is informal, Balsa is rigid.

Three concrete differences, lifted from the Rust reference grammar at github.com/metamolecular/balsa:

  • No aromatic bond character. SMILES has six bond tokens — - = # : / \ — including : for aromatic. Balsa's BondKind enum has only five — Elided, Single, Double, Triple, Up, Down. Aromaticity is carried by the lowercase atom (Selection), never by the bond. SMILES allows benzene as either c1ccccc1 or C1=CC=CC=C1; Balsa specifies only the former.
  • Organic-subset shortcuts are an explicit grammar rule. SMILES "if it's in the organic subset, you can drop the brackets" lives in prose. Balsa's AtomKind::Shortcut is a closed enum: B, C, N, O, F, P, S, Cl, Br, I. Anything else must use Bracket notation. A parser cannot accept [Mg] and Mg as equivalent input.
  • Stereo is per-atom parity, not bond-direction soup. SMILES tetrahedral chirality uses @ / @@ on the atom plus implicit ordering of its neighbours; double-bond stereo uses / and \ on the surrounding bonds and depends on the parser's traversal order. Balsa's AtomParity is an explicit Clockwise / Counterclockwise token on the stereocentre's bracket atom; the canonical-order conventions are written into the grammar instead of the implementation.

What it adds: a parseable spec an implementer can target without reverse-engineering reference parsers. The strings look like SMILES; the difference is that two implementations targeting Balsa cannot disagree about what they decoded.

10. Group SELFIES — 2023

Cheng et al., 2023.21 Standard SELFIES tokenises by atom. Group SELFIES tokenises by chemically meaningful fragment — phenyl, amide, methyl ester, and so on. The vocabulary is larger, the strings shorter, and the model learns in chemistry-shaped chunks instead of one atom at a time.

What it adds: a shorter sequence and a more drug-like distribution out of the box, at the cost of fixing a fragment alphabet up front.

11. SAFE — 2024

Sequential Attachment-based Fragment Embedding. Noutahi et al., 2024.22 SAFE rewrites a SMILES into a sequence of disconnected fragments joined by attachment-point indices. The fragments are themselves valid SMILES; the joins are explicit.

What it adds: fragment-conditional generation (keep this scaffold, regrow the side chain) without leaving the SMILES grammar — a SAFE string is also a valid SMILES once the attachment indices are resolved.

Side-by-side

Two molecules, every notation we have a tool for. Ethanol shows the trivial case — three atoms, no aromaticity, no stereo, no fragments to slice — so most rows agree. Aspirin (acetylsalicylic acid, 21 atoms) is the smallest molecule where the algorithms start to diverge: the ring forces a choice between aromatic perception and Kekulé bonds, and SAFE has enough cuttable bonds to actually fragment. Each row is verified against its reference implementation: the SMILES family via RDKit 2026.3.1, SELFIES via selfies==2.1.1, DeepSMILES via deepsmiles==1.0.1, Group SELFIES via group-selfies with the essential alphabet, SAFE via safe-mol. WLN follows Smith's encoding rules: Q is hydroxyl, V is a carbonyl, R is a phenyl, B is the ortho-position locant on a ring. Balsa retains the SMILES character syntax for its base language; the canonical SMILES is also the canonical Balsa string.

NotationYearKindEthanolAspirinWhat it adds
WLN1949NotationQ2QVR BOV1Typewriter-friendly graph encoding
CAS-RN1965Identifier64-17-550-78-2Institutional registry; bottle labels
IUPAC name1979+Identifierethanol2-acetyloxybenzoic acidHuman-readable; verifiable by hand
SMILES1988NotationCCOCC(=O)Oc1ccccc1C(=O)OHuman-typeable graph traversal
Canonical SMILES (RDKit)1989+NotationCCOCC(=O)Oc1ccccc1C(=O)OLowercase aromatic perception
Canonical SMILES (PubChem/OEChem)1989+NotationCCOCC(=O)OC1=CC=CC=C1C(=O)OKekulé form — same molecule, different bytes
InChI2005IdentifierInChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)Cross-toolkit canonical identity
InChIKey2007IdentifierLFQSCWFLJHTTHZ-UHFFFAOYSA-NBSYNRYMUTXBXSQ-UHFFFAOYSA-NFixed-length hash of InChI
DeepSMILES2018NotationCCOCC=O)Occcccc6C=O)OMutation-friendlier SMILES retrofit
SELFIES2020Notation[C][C][O][C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O]Closed grammar; every string decodes
Balsa2022NotationCCOCC(=O)Oc1ccccc1C(=O)OFormally specified SMILES dialect
Group SELFIES2023Notation[C][C][O][C][C][=Branch][=O][pop][O][C][=C][C][=C][C][=C][Branch][C][=Branch][=O][pop][O][pop][Ring1][=Branch]Shorter sequences; drug-like priors
SAFE2024NotationCCOc13ccccc14.CC2=O.C4(=O)O.O23Scaffold-conditional generation

Bringing them together

Every line on the table above is the same molecule, written by a different tribe — archivists, chemists, ML researchers, polymer engineers, drug-design labs. Each one was right, locally, for the problem it solved. The cost is paid later: a chemist looking up RYYVLZVUVIJVGH-UHFFFAOYSA-N in PubChem, a chemist who only knows it as caffeine, and a chemist whose pipeline emits [C][N][C][=Branch1]… are all asking the same question and getting different answers from the same database.

chempirical exists to take that cost back. The work in flight: a bespoke graph-shaped database engine for molecules — indexed natively on connectivity, not on pre-canonicalised strings — sitting underneath a single search box that accepts every notation on this page. SMILES, InChI, InChIKey, CAS, IUPAC, SMARTS: type any one and land on the same record. Behind the search box, the same chemistry library runs in two places — a high-speed server build for indexing, deduplication, and bulk processing, and a WebAssembly build that ships to the browser so parsing, canonicalisation, and substructure search happen client-side, with no round-trip.

The end state we are building toward: every notation on this page is an interchangeable surface, the underlying graph is the source of truth, and the tribes never have to argue about it again.

The good news about chemistry notations is that there are so many to choose from.

— Adapted from Andrew Tanenbaum, on standards