notations, smiles, inchi, selfies, balsa, history

From WLN to Balsa: a guided tour of how we write down molecules

A field guide to the line notations chemists actually use — SMILES, InChI, IUPAC names, SELFIES, Group SELFIES, Balsa, DeepSMILES, and the half-forgotten 1949 grandparent of all of them.

Every cheminformatics workflow starts with the same question: how do you type a molecule? It sounds trivial. It is not. The line notations chemists use today are the result of seventy-five years of arguments about what a molecule even is — a graph, a string, a hash, a name in a 1500-page rulebook. Each notation makes a different trade between human readability, uniqueness, machine-friendliness, and round-trip fidelity. Pick wrong and your stereo gets mangled, your dataset can't dedupe, or your ML model learns to spell.

Why this is harder than it looks

A molecule is a graph: atoms are nodes, bonds are edges, plus a few side constraints (charge, isotope, stereo, aromaticity). To put it in a column of a database, you have to flatten that graph into a string. Flattening loses information unless you're careful, and even when you are, two different people writing down the same molecule can produce two different strings — which means your `WHERE smiles = ?` query misses half your hits.

The history of line notations is a history of trying to fix this. Some fixes are algorithmic (canonicalization). Some are structural (encode the graph differently). Some are bureaucratic (write a 1500-page rulebook). All of them survive in the wild, in different files, in different decades' datasets, on different desks at the same institution.

SMILES — the lingua franca

Simplified Molecular-Input Line-Entry System. David Weininger, 1986–1988, at the U.S. EPA in Duluth. The brief was disarmingly small: a notation a chemist could type in a single line of ASCII, that round-tripped through a parser, and that any modest computer of the day could read. Weininger delivered. The result is the most-used molecular notation in the world — and a perfect example of a worse-is-better design that won.

CN1C=NC2=C1C(=O)N(C(=O)N2C)C
Caffeine, in SMILES. 28 characters describe a 24-atom molecule.

SMILES describes a depth-first traversal of the molecular graph. Atoms are single letters (or two-letter elements in brackets); bonds are inferred or annotated (`=`, `#`, `:`); ring closures are matched digits; branches are parenthesized; stereo is annotated with `@`, `/`, `\`. It is dense. It is human-readable after about a week of practice. And it has one critical flaw.

The canonicalization problem

SMILES is not unique. Caffeine can be written equally validly as `O=C1N(C)C(=O)c2c1n(C)cn2C` — same molecule, different traversal start, totally different string. To compare two SMILES strings for chemical identity you have to canonicalize both first. The Morgan algorithm (Harry Morgan, 1965 — older than SMILES itself) ranks atoms by extended connectivity and gives a deterministic traversal order. Every SMILES library has its own canonicalization implementation. They mostly agree. When they disagree, you find out at 2 a.m. when your dedup pipeline misses.

InChI — designed by committee, on purpose

IUPAC International Chemical Identifier. Standardized 2005, after roughly a decade of NIST/IUPAC work. The brief was the inverse of SMILES: forget human readability, give us a canonical identifier that two implementations anywhere in the world will compute identically for the same molecule. The result is a layered string that you should never need to type.

InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
Caffeine, InChI standard form. Each /-section is a layer.

The layers are explicit: formula (`C8H10N4O2`), connectivity (`c1-10-...`), hydrogen positions (`h4H,...`), then optional stereo, isotope, charge, and tautomer layers. Each layer is canonical by construction. Two molecules are identical if and only if their standard InChIs are byte-equal.

The InChIKey

InChI strings get long fast — a 40-atom molecule can run 200 characters. The InChIKey is a 27-character SHA-256-based hash of the full InChI: fixed-length, URL-safe, dataset-friendly. The first 14 characters cover the connectivity layer; the next 10 cover stereo, charge, isotope; one char is a version flag. Collisions are theoretically possible (it's a hash). In practice, against the ~150M molecules in PubChem, no real collision has ever been observed.

IUPAC names — the bureaucratic notation

The IUPAC Nomenclature of Organic Chemistry is the most thorough molecular notation that exists. It is also the one chemists most often get wrong. The current rulebook (the 2013 'Blue Book') runs over 1500 pages. It produces strings like:

1,3,7-trimethyl-3,7-dihydro-1H-purine-2,6-dione
Caffeine, preferred IUPAC name (PIN).

Generating an IUPAC name from a structure is computationally hard and rule-dependent. ACD/Labs, ChemAxon, and OpenEye all sell commercial IUPAC name generators because the open-source ones don't compete. The only widely-used free option is STOUT (Kohulan Rajan et al., 2021) — a transformer model trained on PubChem name/structure pairs. It hits around 85% accuracy on simple molecules and degrades on heterocycles, stereo, and inorganics. For hard cases, even the commercial tools disagree about which of several valid names is the 'preferred' one.

SELFIES — SMILES for ML people

Self-Referencing Embedded Strings. Krenn, Häse, Nigam, Friederich, Aspuru-Guzik, 2020. Born of a real ML problem: SMILES strings break under mutation. If a generative model trained on SMILES outputs `c1ccccc1C(`, that's a syntax error — unbalanced parens, no valid molecule. ~99% of random SMILES mutations produce invalid output. SELFIES fix this by design: every syntactically valid SELFIES string maps to a valid molecule.

[C][N][=C][N][=C][Branch1][=N][C][=Branch1][C][=O][N][Branch1][C][C][C][=Branch1][C][=O][N][Ring1][=C][C][C]
Caffeine, in SELFIES. Verbose — but you can mutate any token and still get a valid molecule.

The cost is verbosity. Each atom is a multi-character token in square brackets, branch and ring closures get their own grammar tokens. A molecule that takes 28 characters in SMILES takes ~120 in SELFIES. That's fine — SELFIES isn't meant for humans. It's meant for VAEs, RL agents, and genetic algorithms that need every output to round-trip.

Group SELFIES

Cheng et al., 2023. Standard SELFIES tokenizes by atom. Group SELFIES tokenizes by chemically meaningful fragment — phenyl, amide, methyl ester, etc. The vocabulary is larger, the strings shorter, and the model learns in chemistry-shaped chunks instead of one atom at a time. Early benchmarks show better sample efficiency for molecule generation, especially for drug-like targets.

Balsa — the new alternative

Balsa: Bidirectional, Atom-Level, Stereochemistry-Aware. Hoffman et al., ChemRxiv 2022 (DOI: 10.26434/chemrxiv-2022-01ltp). The pitch is that SELFIES achieves robustness by encoding a constructive grammar, but loses the locality property that makes SMILES interpretable for chemists. Balsa tries to recover both: atom-level tokens that a chemist can read AND a grammar that guarantees validity under mutation.

DeepSMILES — the easy compromise

Noel O'Boyle and Andrew Dalke, 2018. The minimum viable fix for SMILES' ML problems: drop matched parentheses (use a single `)` per branch close), drop matched ring digits (use `%` notation that doesn't require matching). Result: same length as SMILES, only one class of mutation breaks validity instead of three. Almost-as-robust as SELFIES, almost-as-readable as SMILES. Used in some cheminformatics pipelines that want a small change rather than a complete grammar rewrite.

Side-by-side

Notation Year Unique? ML-safe? Use it for
SMILES 1988 No (canon.) No Anything human-facing
Canonical SMILES 1989+ Within one library No DB keys (single library)
InChI 2005 Yes No Cross-tool molecule identity
InChIKey 2007 Effectively yes No Database keys, hashing
IUPAC name 1900s+ No (multiple valid) No Papers, never DB keys
SELFIES 2020 No Yes Generative ML
Group SELFIES 2023 No Yes Drug-like generation
Balsa 2022 No Yes Generative ML w/ readability
DeepSMILES 2018 No Mostly Lightweight ML retrofit
WLN 1949 Within strict syntax No Nothing — historical only
Pick by use case. ML training and database keys want different things.

Practical advice

  • Database key: InChIKey. It's hashable, fixed-length, URL-safe, and cross-tool consistent.
  • Cross-paper / cross-tool molecule identity: InChI (full string, not just key — the key alone has known false positives at scale).
  • User input field: SMILES. Chemists can type it, paste it, and read it.
  • ML model input/output: SELFIES if you need bulletproof validity; SMILES if you accept post-filtering invalid outputs; Balsa or Group SELFIES if you want generation quality at the cost of vocab size.
  • Display name to humans: trivial name first, IUPAC name as fallback. Never store IUPAC names as primary keys.
  • Old data: WLN strings show up in pre-1990 patents and Index Chemicus archives. Conversion tools exist but are partial — treat WLN data as legacy, not as a source of truth.

What we're building here

chempirical's converter and database use SMILES for input, InChIKey for indexing, and the in-house chempirical lib alongside RDKit for property calculation — both engines run on every new molecule, with results stored side-by-side so the divergences (LogP, TPSA, h-bond acceptors) stay visible instead of hidden. SELFIES and Balsa support are on the roadmap for the WASM core. We're not picking a winner; we're letting users see all of them at once.

The good news about chemistry notations is that there are so many to choose from.

— Adapted from Andrew Tanenbaum, on standards

Recent Searches

Acetone
Ethanol
Navigate
esc Close