Every cheminformatics workflow starts with the same question: how do you write a molecule down? It sounds trivial. It is not. The line notations chemists use today are the result of seventy-five years of arguments about what a molecule even is — a graph, a string, a hash, a fragment alphabet, a name in a 1500-page rulebook. Each one was invented to fix a specific shortcoming of what came before. None of them retired.
Notations covered
- WLN 1949 Notation
- CAS Registry Number 1965 Identifier
- IUPAC names 1979+ Identifier
- SMILES 1988 Notation
- Canonical SMILES 1989+ Notation
- InChI / InChIKey 2005 Identifier
- DeepSMILES 2018 Notation (ML)
- SELFIES 2020 Notation (ML)
- Balsa 2022 Notation
- Group SELFIES 2023 Notation (ML)
- SAFE 2024 Notation (ML)
Why this is harder than it looks
A molecule is a graph: atoms are nodes, bonds are edges, plus a few side constraints (charge, isotope, stereo, aromaticity). To put it in a column of a database, you have to flatten that graph into a string. Flattening loses information unless you are careful, and even when you are, two implementations can write the same molecule down differently — which means a literal WHERE smiles = ? query misses half the matching records.
The history of line notations is a history of trying to fix this. Some fixes are algorithmic (canonicalisation). Some are structural (encode the graph differently). Some are bureaucratic (write a 1500-page rulebook). Each new notation augments — rather than replaces — the ones it was reacting to.
Three jobs, not one
Three categories sit underneath the rest of the article.
- Notation — a string that describes this molecule. SMILES, SELFIES, Balsa.
- Identifier — a string that names a molecule canonically, so two implementations agree byte-for-byte. InChI, InChIKey, CAS Registry Number.
- Pattern / reaction — a string that describes a set of molecules, or a transformation between them. SMARTS, SMIRKS, RInChI.
The same character set (Latin letters, digits, brackets) is reused across all three, which is why a SMARTS string and a SMILES string look almost identical until you notice the wildcards.
A chronological tour
Caffeine is the running example. Each block shows what the notation looks like, what it adds over the row above, and links to the underlying paper.
1. WLN — 1949
Wiswesser Line Notation. William Wiswesser was an industrial chemist who needed to write molecule structures on a typewriter. WLN encoded fused ring systems with letter-and-digit qualifiers (T6 for a six-membered alicyclic ring, L for carbocyclic, M for nitrogen) and a deterministic citation order with multiple priority rules for selecting the start atom and traversal direction.1 A 1970s patent abstract may still contain a string like L66J BVQ DV1NN1. The most accurate modern parser2 recovers about 75% of WLN strings in real archives; the residual failure rate is dominated by malformed source strings.
What it adds: typewriter-era graph encoding; the first compact line representation that survived as a real-world index for forty years.
2. CAS Registry Number — 1965
CAS Registry Number, Chemical Abstracts Service. Three groups of digits separated by hyphens; the rightmost digit is a checksum.
58-08-2
What it adds: institutional memory. Reagent bottles, safety data sheets, regulatory filings, and IUPAC's own indexing all reference CAS numbers. They are not derivable from structure — every new molecule receives a number assigned by humans at CAS.
3. IUPAC names — 1979 onwards
The IUPAC Nomenclature of Organic Chemistry is the most thorough molecular notation that exists. The current rulebook (the 2013 'Blue Book') runs over 1500 pages.
1,3,7-trimethyl-3,7-dihydro-1H-purine-2,6-dione
1,3,7-trimethylpurine-2,6-dione is also accepted.Generating a name from a structure is computationally hard and rule-dependent. ACD/Labs, ChemAxon, and OpenEye all sell commercial IUPAC name generators because the open-source ones do not compete. The widely-used free option is STOUT (Rajan et al., 2021)3 — originally an LSTM trained on PubChem name/structure pairs. The 2021 paper reports an average BLEU of about 90% and Tanimoto similarity above 0.9. STOUT V2.0 (2024)4 is the transformer version, trained on near-1B SMILES/IUPAC pairs.
What it adds: a name a human can read aloud and a referee can verify by hand.
4. SMILES — 1988
Simplified Molecular-Input Line-Entry System. David Weininger, 1986–1988, at the U.S. EPA in Duluth.5 The brief was small: a notation a chemist could type in a single line of ASCII, that round-tripped through a parser, and that any modest computer of the day could read.
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
SMILES describes a depth-first traversal of the molecular graph. Atoms are single letters (or two-letter elements in brackets); bonds are inferred or annotated (=, #, :); ring closures are matched digits; branches are parenthesised; stereo is annotated with @, /, \.
What it adds: a grammar simple enough to teach in an afternoon, and a graph-shaped semantics that maps directly onto the data structures of every modern toolkit.
5. Canonical SMILES — 1989+
SMILES is not unique. Caffeine can be written equally validly as CN1C(=O)N(C)c2ncn(C)c2C1=O — same molecule, different traversal start, totally different string. To compare two SMILES strings for chemical identity you have to canonicalise both first. The Morgan algorithm6 ranks atoms by extended connectivity and gives a deterministic traversal order. Every toolkit ships its own implementation. They mostly agree; they do not always agree.7 8
RDKitCn1c(=O)c2c(ncn2C)n(C)c1=OOpenEye / PubChemCN1C=NC2=C1C(=O)N(C(=O)N2C)COpen BabelCn1cnc2c1c(=O)n(C)c(=O)n2CIndigoCN1C(=O)C2=C(N=CN2C)N(C)C1=O
RYYVLZVUVIJVGH-UHFFFAOYSA-N); none are byte-equal. Aligned with star alignment; gaps shown as ·.9aOpenSMILES9 sits alongside this work: a community effort started by Craig James and Andrew Dalke to document what every implementation actually does, so a parser written today matches a parser written in 1995.
What it adds: a deterministic dedup key inside a single toolkit; a path toward cross-toolkit agreement once all four canonicalisers converge.
6. InChI & InChIKey — 2005 / 2007
IUPAC International Chemical Identifier. Standardised in 2005,12 after roughly a decade of NIST/IUPAC work. The brief was the inverse of SMILES: a canonical identifier that two implementations anywhere in the world will compute identically for the same molecule.
InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
/-section is a layer.The layers are explicit: formula (C8H10N4O2), connectivity (c1-10-…), hydrogen positions (h4H,…), then optional stereo, isotope, charge, and tautomer layers. Two molecules are identical if and only if their standard InChIs are byte-equal.
InChI strings get long fast. The InChIKey is a 27-character SHA-256-based hash of the full InChI: fixed-length, URL-safe. The first 14 characters are a hash of the connectivity layer; the next 10 contain a hash of the remaining layers plus a version-and-protonation flag; the final character indicates protonation state.13 No real collision has been documented in the wild.14
RYYVLZVUVIJVGH-UHFFFAOYSA-N
What they add: a single agreed-on canonicalisation algorithm — implemented once, in C, by the InChI Trust — and a short hash that fits in a database column.
7. DeepSMILES — 2018
Noel O'Boyle and Andrew Dalke, 2018.17 The minimum viable fix for SMILES' machine-learning problems: drop matched parentheses (use a single ) per branch close), drop matched ring digits. Same length as SMILES, only one class of mutation breaks validity instead of three.
Cnc=O)ccncn5C))))nC)c6=O
What it adds: a near-zero-cost retrofit for pipelines that already speak SMILES.
8. SELFIES — 2020
Self-Referencing Embedded Strings. Krenn, Häse, Nigam, Friederich, Aspuru-Guzik, 2020.19 Born of a real ML problem: SMILES strings break under mutation. If a generative model trained on SMILES outputs c1ccccc1C(, that is a syntax error — unbalanced parens, no valid molecule. Roughly 99% of random SMILES mutations produce invalid output. SELFIES fix this by design: every syntactically valid SELFIES string maps to a valid molecule.
[C][N][C][=Branch1][C][=O][C][=C][Branch1][#Branch1][N][=C][N][Ring1][Branch1][C][N][Branch1][C][C][C][Ring1][N][=O]
What it adds: a closed grammar where the parser, not a downstream filter, enforces validity.
9. Balsa — 2022
Balsa: A Compact Line Notation Based on SMILES. Richard L. Apodaca (Metamolecular, LLC), 2022.20 Balsa keeps SMILES' character set and looks identical for most molecules — ethanol is CCO in both, aspirin is CC(=O)Oc1ccccc1C(=O)O. The break is in the grammar, not the strings: every place where SMILES is informal, Balsa is rigid.
Three concrete differences, lifted from the Rust reference grammar at github.com/metamolecular/balsa:
- No aromatic bond character. SMILES has six bond tokens —
- = # : / \— including:for aromatic. Balsa'sBondKindenum has only five — Elided, Single, Double, Triple, Up, Down. Aromaticity is carried by the lowercase atom (Selection), never by the bond. SMILES allows benzene as eitherc1ccccc1orC1=CC=CC=C1; Balsa specifies only the former. - Organic-subset shortcuts are an explicit grammar rule. SMILES "if it's in the organic subset, you can drop the brackets" lives in prose. Balsa's
AtomKind::Shortcutis a closed enum:B, C, N, O, F, P, S, Cl, Br, I. Anything else must useBracketnotation. A parser cannot accept[Mg]andMgas equivalent input. - Stereo is per-atom parity, not bond-direction soup. SMILES tetrahedral chirality uses
@/@@on the atom plus implicit ordering of its neighbours; double-bond stereo uses/and\on the surrounding bonds and depends on the parser's traversal order. Balsa'sAtomParityis an explicitClockwise/Counterclockwisetoken on the stereocentre's bracket atom; the canonical-order conventions are written into the grammar instead of the implementation.
What it adds: a parseable spec an implementer can target without reverse-engineering reference parsers. The strings look like SMILES; the difference is that two implementations targeting Balsa cannot disagree about what they decoded.
10. Group SELFIES — 2023
Cheng et al., 2023.21 Standard SELFIES tokenises by atom. Group SELFIES tokenises by chemically meaningful fragment — phenyl, amide, methyl ester, and so on. The vocabulary is larger, the strings shorter, and the model learns in chemistry-shaped chunks instead of one atom at a time.
What it adds: a shorter sequence and a more drug-like distribution out of the box, at the cost of fixing a fragment alphabet up front.
11. SAFE — 2024
Sequential Attachment-based Fragment Embedding. Noutahi et al., 2024.22 SAFE rewrites a SMILES into a sequence of disconnected fragments joined by attachment-point indices. The fragments are themselves valid SMILES; the joins are explicit.
What it adds: fragment-conditional generation (keep this scaffold, regrow the side chain) without leaving the SMILES grammar — a SAFE string is also a valid SMILES once the attachment indices are resolved.
Side-by-side
Two molecules, every notation we have a tool for. Ethanol shows the trivial case — three atoms, no aromaticity, no stereo, no fragments to slice — so most rows agree. Aspirin (acetylsalicylic acid, 21 atoms) is the smallest molecule where the algorithms start to diverge: the ring forces a choice between aromatic perception and Kekulé bonds, and SAFE has enough cuttable bonds to actually fragment. Each row is verified against its reference implementation: the SMILES family via RDKit 2026.3.1, SELFIES via selfies==2.1.1, DeepSMILES via deepsmiles==1.0.1, Group SELFIES via group-selfies with the essential alphabet, SAFE via safe-mol. WLN follows Smith's encoding rules: Q is hydroxyl, V is a carbonyl, R is a phenyl, B is the ortho-position locant on a ring. Balsa retains the SMILES character syntax for its base language; the canonical SMILES is also the canonical Balsa string.
| Notation | Year | Kind | Ethanol | Aspirin | What it adds |
|---|---|---|---|---|---|
| WLN | 1949 | Notation | Q2 | QVR BOV1 | Typewriter-friendly graph encoding |
| CAS-RN | 1965 | Identifier | 64-17-5 | 50-78-2 | Institutional registry; bottle labels |
| IUPAC name | 1979+ | Identifier | ethanol | 2-acetyloxybenzoic acid | Human-readable; verifiable by hand |
| SMILES | 1988 | Notation | CCO | CC(=O)Oc1ccccc1C(=O)O | Human-typeable graph traversal |
| Canonical SMILES (RDKit) | 1989+ | Notation | CCO | CC(=O)Oc1ccccc1C(=O)O | Lowercase aromatic perception |
| Canonical SMILES (PubChem/OEChem) | 1989+ | Notation | CCO | CC(=O)OC1=CC=CC=C1C(=O)O | Kekulé form — same molecule, different bytes |
| InChI | 2005 | Identifier | InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 | InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12) | Cross-toolkit canonical identity |
| InChIKey | 2007 | Identifier | LFQSCWFLJHTTHZ-UHFFFAOYSA-N | BSYNRYMUTXBXSQ-UHFFFAOYSA-N | Fixed-length hash of InChI |
| DeepSMILES | 2018 | Notation | CCO | CC=O)Occcccc6C=O)O | Mutation-friendlier SMILES retrofit |
| SELFIES | 2020 | Notation | [C][C][O] | [C][C][=Branch1][C][=O][O][C][=C][C][=C][C][=C][Ring1][=Branch1][C][=Branch1][C][=O][O] | Closed grammar; every string decodes |
| Balsa | 2022 | Notation | CCO | CC(=O)Oc1ccccc1C(=O)O | Formally specified SMILES dialect |
| Group SELFIES | 2023 | Notation | [C][C][O] | [C][C][=Branch][=O][pop][O][C][=C][C][=C][C][=C][Branch][C][=Branch][=O][pop][O][pop][Ring1][=Branch] | Shorter sequences; drug-like priors |
| SAFE | 2024 | Notation | CCO | c13ccccc14.CC2=O.C4(=O)O.O23 | Scaffold-conditional generation |
Bringing them together
Every line on the table above is the same molecule, written by a different tribe — archivists, chemists, ML researchers, polymer engineers, drug-design labs. Each one was right, locally, for the problem it solved. The cost is paid later: a chemist looking up RYYVLZVUVIJVGH-UHFFFAOYSA-N in PubChem, a chemist who only knows it as caffeine, and a chemist whose pipeline emits [C][N][C][=Branch1]… are all asking the same question and getting different answers from the same database.
chempirical exists to take that cost back. The work in flight: a bespoke graph-shaped database engine for molecules — indexed natively on connectivity, not on pre-canonicalised strings — sitting underneath a single search box that accepts every notation on this page. SMILES, InChI, InChIKey, CAS, IUPAC, SMARTS: type any one and land on the same record. Behind the search box, the same chemistry library runs in two places — a high-speed server build for indexing, deduplication, and bulk processing, and a WebAssembly build that ships to the browser so parsing, canonicalisation, and substructure search happen client-side, with no round-trip.
The end state we are building toward: every notation on this page is an interchangeable surface, the underlying graph is the source of truth, and the tribes never have to argue about it again.
The good news about chemistry notations is that there are so many to choose from.