Reactions and retrosynthesis — from 1985 to today

A reaction is two molecule sets and an arrow between them. The notations that describe reactions, the file formats that carry them, the databases that store millions of them, and the retrosynthesis engines that walk backward through them all coexist in this slice of the ecosystem. The MDL RXN file (1985), Reaction SMILES (1988), and Reaction InChI (2013) are the load-bearing carriers; the rest is tooling on top. This is the fourth article in the series — notations, molecule file formats, spectroscopy formats are the earlier three.

Four jobs

Notation — describe one reaction concretely (Reaction SMILES) or as a pattern (SMARTS, SMIRKS).
File format — carry one reaction on disk with full layout (MDL RXN). See also file formats §.
Identifier — hash a reaction so two implementations agree byte-for-byte (RInChI, RInChIKey).
Database / tool — store millions of reactions (USPTO, ORD, Pistachio, Reaxys) and walk backward through them (ASKCOS, AiZynthFinder, IBM RXN).

A chronological tour

Each entry below gets a numbered block. The notation entries are short — Reaction SMILES is one extra character (>) on top of SMILES, SMIRKS adds atom-mapping numbers to SMARTS — so most of the article is the tooling stack.

1. MDL RXN file — 1985

MDL's reaction file format. Header, then N reactant Mol blocks, then M product Mol blocks. ASCII; same V2000 / V3000 dialect as the molecule Mol files. Atom-mapping numbers in the reactant + product blocks pair atoms across the arrow (the same convention SMIRKS uses inline).

What it adds: reaction direction, atom mapping, and stoichiometry, with the full Mol-file detail per side. See Cheminformatics file formats § for the structural-format side of this overlap.

2. Reaction SMILES — 1988

SMILES with a single new character: >. Reactants on the left, optional reagents in the middle, products on the right (reactants>reagents>products). Inherits everything from SMILES — same atom-and-bond grammar, same canonicalisation problem — and adds reaction direction. Atom-mapping numbers are optional in vanilla Reaction SMILES; SMIRKS makes them mandatory.

OC(=O)C1=CC=CC=C1.OCC>>O=C(OCC)C1=CC=CC=C1

Esterification of benzoic acid with ethanol → ethyl benzoate. No mapping numbers; this is the concrete-reaction form.

What it adds: a single-line container for reactions that every SMILES-aware tool reads for free.

3. SMARTS — 1990s

Daylight published the original specification as a companion to SMILES.¹ It is to SMILES what regular expressions are to strings: the parent grammar extended with wildcards, logical operators, and topological constraints. Lives in this article (rather than the line-notations one) because the primary use of SMARTS is filtering and substructure search across reaction templates, not describing a specific molecule.

c matches any aromatic carbon. [#6] matches any carbon. [F,Cl,Br,I] matches any halogen. [#6]=[O] matches a carbonyl. Bonds: ~ is any bond, @ is any ring bond, !@ excludes ring bonds. Recursion is supported via $().

[#16](=[#8])(=[#8])[#7]

A sulfonamide — sulfur double-bonded to two oxygens, single-bonded to a nitrogen.

What it adds: substructure search, filter rules, and the atom-half of every reaction template.

4. SMIRKS — 1990s

SMARTS plus a reaction arrow plus mandatory atom-mapping numbers.² Reads like a chemistry rule: "a carboxylic acid plus an alcohol gives an ester, and the carbonyl carbon is the same atom on both sides". Every reaction-template-driven tool — ASKCOS, RDKit's RunReactants, ChemAxon's reactor, RetroPath — speaks SMIRKS as input.

[C:1](=[O:2])[OH:3].[OH:4][C:5]>>[C:1](=[O:2])[O:4][C:5].[OH2:3]

Esterification template. Map numbers (:1…:5) carry atoms from reactant side to product side; [OH2:3] is the leaving water.

What it adds: a parseable, executable, vendor-neutral way to describe "this kind of reaction" — the unit retrosynthesis engines compose into multi-step plans.

5. USPTO reaction corpus — 2012+

Daniel Lowe's PhD thesis (Cambridge, 2012) extracted ~1.8M reactions from U.S. patent text using rule-based parsing. The corpus has been re-released, extended, and re-cleaned multiple times since;³ the most-cited public dump is the 1976–2016 grants + applications set (~3.7M reactions). This is the dataset most retrosynthesis models train on.

What it adds: a public, parseable, multi-million-reaction corpus that earlier reaction databases (Reaxys, CASRN reactions) did not permit redistributing. Most of modern ML retrosynthesis stands on it.

6. RInChI / RInChIKey — 2013

Reaction InChI. Grethe, Goodman, Allen, 2013.⁴ A canonicalised, layered identifier that hashes a complete reaction (reactants, products, agents) into a single string. The companion RInChIKey is the fixed-length URL-safe hash, the way InChIKey is for InChI. Built on the InChI Trust's reference C library.

What it adds: a deterministic identifier for "is this the same reaction?" queries across implementations. Reaction SMILES doesn't canonicalise; RInChI does.

7. ASKCOS — 2017

MIT's open-source retrosynthesis platform.⁵ Combines a template-based retrosynthesis engine (templates extracted from USPTO via SMIRKS), a forward-prediction model (graph neural net on USPTO), and condition / solvent / catalyst recommendations. Web service at askcos.mit.edu; source on GitHub.

What it adds: the first end-to-end open-source pipeline from "target molecule" to "step-by-step synthetic route plus conditions".

8. IBM RXN for Chemistry — 2018

IBM Research's transformer-based reaction prediction service.⁶ Treats reaction prediction as a sequence-to-sequence translation problem on Reaction SMILES — encoder reads reactants, decoder emits products. Trained originally on USPTO; the underlying RXN-Transformer architecture has been adapted for retrosynthesis, condition prediction, and yield estimation.

What it adds: end-to-end seq2seq prediction without explicit reaction templates — directly from atoms to atoms. The bench-mark every template-based system gets compared against.

9. AiZynthFinder — 2020

AstraZeneca's open-source Monte Carlo Tree Search retrosynthesis tool.⁷ Template-based with a neural-network policy that ranks template applicability. Designed to be vendor-friendly: ships with USPTO-trained models, but the policy and stock-molecule lookup are pluggable so internal compound libraries can be substituted in.

What it adds: a production-shaped retrosynthesis engine intended to be deployed inside a pharma stack with private stocks and private templates.

10. Open Reaction Database — 2021

Open Reaction Database (ORD).⁸ Cross-industry effort (initiated by Pfizer with Mike Burke, Connor Coley, and others). Defines a structured Protocol Buffer schema for reactions: every reagent has structure plus role plus quantity plus solvent context, every step has temperature/time/atmosphere, every outcome has yield + purity + analytics. Public data on GitHub; reading clients in Python, JavaScript, and Java.

What it adds: a structured schema for reaction reporting that captures the metadata Reaction SMILES and RXN files leave implicit. Most useful for ML: ORD records yield + condition data the USPTO corpus lacks.

Side-by-side

Same esterification — benzoic acid + ethanol → ethyl benzoate + water — rendered every way it can be. The Reaction SMILES is verifiable by parsing with RDKit; the SMIRKS is a template that matches and produces this reaction as one of many.

Form	Year	Kind	Esterification
MDL RXN	1985	File	multi-line; see file formats §
Reaction SMILES	1988	Notation	`OC(=O)c1ccccc1.OCC>>O=C(OCC)c1ccccc1.O`
SMARTS (subset of acid)	1990s	Pattern	`[CX3](=O)[OX2H1]`
SMIRKS template	1990s	Pattern	`[C:1](=[O:2])[OH:3].[OH:4][C:5]>>[C:1](=[O:2])[O:4][C:5].[OH2:3]`
RInChI	2013	Identifier	(deterministic hash; see RInChI Trust spec)

Bringing them together

Reaction data is the third axis of fragmentation, after notations and file formats. The same esterification, written in three labs, ends up as a Reaction SMILES in one Jupyter notebook, an RXN file in a SAR dataset, and a yield-annotated ORD record in a paper's supplementary information — and a retrosynthesis engine that wants all three has to bridge them.

chempirical's plan: reactions are first-class records in the same graph-shaped engine that holds molecules. Reactant + product graphs share atoms with their molecule records via the same canonicalisation pipeline; map numbers stay explicit; yield + condition metadata sits in a sibling table. Imports normalise Reaction SMILES, RXN files, and ORD Protocol Buffer messages into the common representation; exports re-emit. Search by any of the four — paste a Reaction SMILES, an RInChIKey, a SMIRKS template, an ORD reaction ID — and land on the same record.