From WLN to Balsa: a guided tour of how we write down molecules
A field guide to the line notations chemists actually use — SMILES, InChI, IUPAC names, SELFIES, Group SELFIES, Balsa, DeepSMILES, and the half-forgotten 1949 grandparent of all of them.
Every cheminformatics workflow starts with the same question: how do you type a molecule? It sounds trivial. It is not. The line notations chemists use today are the result of seventy-five years of arguments about what a molecule even is — a graph, a string, a hash, a name in a 1500-page rulebook. Each notation makes a different trade between human readability, uniqueness, machine-friendliness, and round-trip fidelity. Pick wrong and your stereo gets mangled, your dataset can't dedupe, or your ML model learns to spell.
Why this is harder than it looks
A molecule is a graph: atoms are nodes, bonds are edges, plus a few side constraints (charge, isotope, stereo, aromaticity). To put it in a column of a database, you have to flatten that graph into a string. Flattening loses information unless you're careful, and even when you are, two different people writing down the same molecule can produce two different strings — which means your `WHERE smiles = ?` query misses half your hits.
The history of line notations is a history of trying to fix this. Some fixes are algorithmic (canonicalization). Some are structural (encode the graph differently). Some are bureaucratic (write a 1500-page rulebook). All of them survive in the wild, in different files, in different decades' datasets, on different desks at the same institution.
SMILES — the lingua franca
Simplified Molecular-Input Line-Entry System. David Weininger, 1986–1988, at the U.S. EPA in Duluth. The brief was disarmingly small: a notation a chemist could type in a single line of ASCII, that round-tripped through a parser, and that any modest computer of the day could read. Weininger delivered. The result is the most-used molecular notation in the world — and a perfect example of a worse-is-better design that won.
CN1C=NC2=C1C(=O)N(C(=O)N2C)C
SMILES describes a depth-first traversal of the molecular graph. Atoms are single letters (or two-letter elements in brackets); bonds are inferred or annotated (`=`, `#`, `:`); ring closures are matched digits; branches are parenthesized; stereo is annotated with `@`, `/`, `\`. It is dense. It is human-readable after about a week of practice. And it has one critical flaw.
The canonicalization problem
SMILES is not unique. Caffeine can be written equally validly as `O=C1N(C)C(=O)c2c1n(C)cn2C` — same molecule, different traversal start, totally different string. To compare two SMILES strings for chemical identity you have to canonicalize both first. The Morgan algorithm (Harry Morgan, 1965 — older than SMILES itself) ranks atoms by extended connectivity and gives a deterministic traversal order. Every SMILES library has its own canonicalization implementation. They mostly agree. When they disagree, you find out at 2 a.m. when your dedup pipeline misses.
InChI — designed by committee, on purpose
IUPAC International Chemical Identifier. Standardized 2005, after roughly a decade of NIST/IUPAC work. The brief was the inverse of SMILES: forget human readability, give us a canonical identifier that two implementations anywhere in the world will compute identically for the same molecule. The result is a layered string that you should never need to type.
InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3
The layers are explicit: formula (`C8H10N4O2`), connectivity (`c1-10-...`), hydrogen positions (`h4H,...`), then optional stereo, isotope, charge, and tautomer layers. Each layer is canonical by construction. Two molecules are identical if and only if their standard InChIs are byte-equal.
The InChIKey
InChI strings get long fast — a 40-atom molecule can run 200 characters. The InChIKey is a 27-character SHA-256-based hash of the full InChI: fixed-length, URL-safe, dataset-friendly. The first 14 characters cover the connectivity layer; the next 10 cover stereo, charge, isotope; one char is a version flag. Collisions are theoretically possible (it's a hash). In practice, against the ~150M molecules in PubChem, no real collision has ever been observed.
IUPAC names — the bureaucratic notation
The IUPAC Nomenclature of Organic Chemistry is the most thorough molecular notation that exists. It is also the one chemists most often get wrong. The current rulebook (the 2013 'Blue Book') runs over 1500 pages. It produces strings like:
1,3,7-trimethyl-3,7-dihydro-1H-purine-2,6-dione
Generating an IUPAC name from a structure is computationally hard and rule-dependent. ACD/Labs, ChemAxon, and OpenEye all sell commercial IUPAC name generators because the open-source ones don't compete. The only widely-used free option is STOUT (Kohulan Rajan et al., 2021) — a transformer model trained on PubChem name/structure pairs. It hits around 85% accuracy on simple molecules and degrades on heterocycles, stereo, and inorganics. For hard cases, even the commercial tools disagree about which of several valid names is the 'preferred' one.
SELFIES — SMILES for ML people
Self-Referencing Embedded Strings. Krenn, Häse, Nigam, Friederich, Aspuru-Guzik, 2020. Born of a real ML problem: SMILES strings break under mutation. If a generative model trained on SMILES outputs `c1ccccc1C(`, that's a syntax error — unbalanced parens, no valid molecule. ~99% of random SMILES mutations produce invalid output. SELFIES fix this by design: every syntactically valid SELFIES string maps to a valid molecule.
[C][N][=C][N][=C][Branch1][=N][C][=Branch1][C][=O][N][Branch1][C][C][C][=Branch1][C][=O][N][Ring1][=C][C][C]
The cost is verbosity. Each atom is a multi-character token in square brackets, branch and ring closures get their own grammar tokens. A molecule that takes 28 characters in SMILES takes ~120 in SELFIES. That's fine — SELFIES isn't meant for humans. It's meant for VAEs, RL agents, and genetic algorithms that need every output to round-trip.
Group SELFIES
Cheng et al., 2023. Standard SELFIES tokenizes by atom. Group SELFIES tokenizes by chemically meaningful fragment — phenyl, amide, methyl ester, etc. The vocabulary is larger, the strings shorter, and the model learns in chemistry-shaped chunks instead of one atom at a time. Early benchmarks show better sample efficiency for molecule generation, especially for drug-like targets.
Balsa — the new alternative
Balsa: Bidirectional, Atom-Level, Stereochemistry-Aware. Hoffman et al., ChemRxiv 2022 (DOI: 10.26434/chemrxiv-2022-01ltp). The pitch is that SELFIES achieves robustness by encoding a constructive grammar, but loses the locality property that makes SMILES interpretable for chemists. Balsa tries to recover both: atom-level tokens that a chemist can read AND a grammar that guarantees validity under mutation.
DeepSMILES — the easy compromise
Noel O'Boyle and Andrew Dalke, 2018. The minimum viable fix for SMILES' ML problems: drop matched parentheses (use a single `)` per branch close), drop matched ring digits (use `%` notation that doesn't require matching). Result: same length as SMILES, only one class of mutation breaks validity instead of three. Almost-as-robust as SELFIES, almost-as-readable as SMILES. Used in some cheminformatics pipelines that want a small change rather than a complete grammar rewrite.
Side-by-side
| Notation | Year | Unique? | ML-safe? | Use it for |
|---|---|---|---|---|
| SMILES | 1988 | No (canon.) | No | Anything human-facing |
| Canonical SMILES | 1989+ | Within one library | No | DB keys (single library) |
| InChI | 2005 | Yes | No | Cross-tool molecule identity |
| InChIKey | 2007 | Effectively yes | No | Database keys, hashing |
| IUPAC name | 1900s+ | No (multiple valid) | No | Papers, never DB keys |
| SELFIES | 2020 | No | Yes | Generative ML |
| Group SELFIES | 2023 | No | Yes | Drug-like generation |
| Balsa | 2022 | No | Yes | Generative ML w/ readability |
| DeepSMILES | 2018 | No | Mostly | Lightweight ML retrofit |
| WLN | 1949 | Within strict syntax | No | Nothing — historical only |
Practical advice
- Database key: InChIKey. It's hashable, fixed-length, URL-safe, and cross-tool consistent.
- Cross-paper / cross-tool molecule identity: InChI (full string, not just key — the key alone has known false positives at scale).
- User input field: SMILES. Chemists can type it, paste it, and read it.
- ML model input/output: SELFIES if you need bulletproof validity; SMILES if you accept post-filtering invalid outputs; Balsa or Group SELFIES if you want generation quality at the cost of vocab size.
- Display name to humans: trivial name first, IUPAC name as fallback. Never store IUPAC names as primary keys.
- Old data: WLN strings show up in pre-1990 patents and Index Chemicus archives. Conversion tools exist but are partial — treat WLN data as legacy, not as a source of truth.
What we're building here
chempirical's converter and database use SMILES for input, InChIKey for indexing, and the in-house chempirical lib alongside RDKit for property calculation — both engines run on every new molecule, with results stored side-by-side so the divergences (LogP, TPSA, h-bond acceptors) stay visible instead of hidden. SELFIES and Balsa support are on the roadmap for the WASM core. We're not picking a winner; we're letting users see all of them at once.
The good news about chemistry notations is that there are so many to choose from.