Generative Chemistry: AI-Designed Molecules

How diffusion and structure-based models design small molecules — and the docking, synthesizability, and ADMET gaps that decide what reaches a flask.

Generative Chemistry: AI-Designed Molecules

For most of the history of medicinal chemistry, the molecule came last. You picked a target, ran a screen, found a weak hit, and then a team of chemists spent years walking that hit through hundreds of analogues toward something potent, selective, and droggable enough to dose a human. Generative chemistry inverts the order: the model proposes the molecule first, conditioned on the protein you want to hit, and the medicinal chemistry becomes a filtering and verification problem rather than an invention problem.

That inversion is now producing clinical assets, not just papers. The most concrete proof point is Insilico Medicine’s rentosertib, a TNIK inhibitor for idiopathic pulmonary fibrosis where a generative pipeline proposed both the target and the molecule. Phase 2a results published in Nature Medicine in June 2025 showed the 60 mg once-daily arm gaining a mean +98.4 mL in forced vital capacity over 12 weeks against a -20.3 mL decline on placebo — a small trial, but a real readout from a molecule that started life inside a generative model. Whatever you think of the hype around AI drug design, the wet-lab end of the loop has closed at least once.

This is an engineering post, so the interesting part is not the headline. It is the pipeline: what the models actually generate, how candidates get scored, and where the whole thing quietly falls apart if you are not careful.

What “generative” actually means here#

There are two dominant families, and they make very different assumptions.

The first is ligand-based generation — sequence and graph models that learn the distribution of drug-like molecules and sample new ones, optionally steered toward a property objective. SMILES-based language models and graph generators live here. They are fast, they produce valid chemistry, and they know nothing about your protein. You bolt on the target afterward through scoring.

The second, and the one that has pulled ahead, is structure-based generation: diffusion and flow models that grow a molecule directly inside a protein binding pocket. The model conditions on the 3D pocket geometry and denoises atoms into a pose that should complement it. This is the small-molecule analogue of the structure-prediction work that came out of the AlphaFold lineage, and it is the bet Isomorphic Labs has made most publicly. Their pitch is that predicting binding geometry well enough to design against it is the hard, valuable part, and the firm has raised heavily against that thesis and expects its first AI-designed molecules in clinical trials around the end of 2026.

Recursion sits in a third camp worth naming: less about generating from a pocket, more about generating and then experimentally triaging at industrial volume, with cellular imaging and omics readouts feeding the models. The strategies differ, but they all share the same downstream problem — a generative model will happily hand you a molecule that scores beautifully and cannot survive contact with reality.

Laboratory sample vials on a stainless steel autosampler tray

Docking and the ML scoring trap#

Once you have candidates, you need to rank them, and ranking means estimating how well each molecule binds. Physics-based docking is the workhorse: pose the ligand in the pocket, score the interaction. It is cheap relative to a free-energy calculation and it scales to millions of compounds, which is exactly what generative output demands.

The problem is that docking scores are a notoriously leaky proxy for affinity. A docking function is tuned to find a plausible pose, not to rank potency across chemically diverse molecules, and it rewards compounds that fill space and stack interactions whether or not those interactions are real. Feed a generative model a docking score as its reward and it will learn to exploit the score function — producing greasy, over-decorated molecules that game the geometry and would never reproduce in an assay. This is reward hacking in a lab coat.

The current answer is to layer machine-learned scoring on top: ML potency models, and crucially, physics-based rescoring of the top tier with free-energy perturbation. FEP is expensive enough that you cannot run it on a million molecules, so the pipeline becomes a funnel — generate broadly, dock cheaply, rescore the survivors with something closer to physics, and only then trust the ranking. Any team treating a raw docking score as ground truth is shipping false positives downstream, and false positives in this business cost synthesis slots and assay weeks.

The number that matters is hit rate, not score#

A useful discipline borrowed from the protein-design world: judge a generative pipeline by its experimental hit rate, not by how good its in silico scores look. A model that produces gorgeous predicted affinities and a hit rate below 10% in the wet lab is worse than a cruder model with a higher confirmed rate. The score distribution is marketing; the confirmation rate is the product.

The synthesizability gap#

Here is the failure mode that humbles most newcomers. A generative model optimizing for binding will discover that exotic, strained, or simply unmakeable structures often score well — and it has no native concept of whether a human or a robot can actually make the thing. You end up with a ranked list of molecules where the top entries are, in practice, chemistry fan fiction.

The historical patch is the synthetic accessibility (SA) score, a fast heuristic based on fragment frequency and structural complexity. It is better than nothing and it is not enough. As recent work in Chemical Science spells out, the SA score does not guarantee that a synthetic route actually exists — it correlates with makeability without proving it. The stronger approach is to put a retrosynthesis model in the loop, planning an actual route from purchasable starting materials and rejecting molecules that have none.

That fix has its own trap. Data-driven retrosynthesis planners are prone to hallucinating reactions that look valid and do not run, so a route the model is “confident” about can still be fiction. The defensible pattern in 2026 is to pair a retrosynthetic planner with a forward reaction predictor and only accept a route both agree on — duality as a sanity check. Whatever the mechanism, the principle is fixed: if synthesizability is not a first-class objective inside the generation loop, you are optimizing for molecules your chemists will quietly veto, and you will not find out until the route-scoping meeting.

High-performance liquid chromatography instrument on a laboratory bench

ADMET: the gap nobody can model away#

Suppose you have a molecule that binds, that you can rank with physics-grade scoring, and that you can actually synthesize. You are still nowhere near a drug, because the molecule has to be absorbed, distributed, not metabolized into oblivion, not toxic, and not blocked by the hERG channel or chewed up by CYP enzymes. This is ADMET — absorption, distribution, metabolism, excretion, toxicity — and it is where most candidates die.

ADMET prediction is genuinely hard for reasons that are structural, not temporary. The endpoints are noisy, the public datasets are small and biased toward whatever pharma has historically chosen to measure, and many properties depend on whole-organism behavior that no amount of molecular structure fully encodes. A model can learn solubility and permeability passably; it learns idiosyncratic hepatotoxicity far less well, because the data to learn from barely exists. Be honest about this. A generative pipeline that claims to optimize for ADMET is, in most cases, optimizing for a handful of well-measured proxies and hoping the rest correlate.

The practical posture: use ADMET models as early de-risking filters, not as gates you trust to pass a molecule. They are excellent at killing obviously bad chemotypes early and saving assay budget. They are poor at certifying that a survivor is clean. The wet lab remains the arbiter, and the role of the model is to make sure the wet lab spends its time on candidates that have a chance.

What this means for building a real pipeline#

Strip away the model architectures and a generative chemistry program is a data and orchestration problem — the same discipline we bring to any AI implementation for regulated industries. The generation step is a single stage in a multi-stage funnel, and the engineering value lives in the connective tissue: a clean store of assay results, versioned models, deterministic reruns, and a feedback path that pushes confirmed wet-lab outcomes back into the next training round. The teams winning here have built a proper Data Platform under the chemistry — provenance on every molecule, every score, every assay — and treated Operational Automation of the design-score-filter loop as core infrastructure rather than a notebook someone runs by hand.

That is also where most efforts stall. The models are increasingly commodity; the disciplined loop around them is not. A generative model is only as good as the experimental feedback it learns from, and that feedback is worthless if you cannot trace which model version proposed which molecule that produced which assay number. Get the plumbing right and the chemistry compounds. Get it wrong and you have an expensive random number generator with a binding-affinity theme.

The honest summary for 2026: generative chemistry has stopped being speculative and started shipping clinical candidates, but the model is the easy 20%. Docking discipline, synthesizability in the loop, and sober ADMET expectations are the 80% that decides whether anything you generate ever sees a flask.


Building a generative or computational chemistry pipeline and want the funnel — not just the model — engineered properly? Talk to our team. We build the data platforms and automation loops that turn model output into verifiable candidates.