De Novo Protein Design with Diffusion Models
RFdiffusion generates backbones, ProteinMPNN writes the sequence, AlphaFold2 filters. Inside the loop that actually ships designed binders and enzymes.
For most of its history, protein engineering meant editing what nature already wrote. You took an existing enzyme or antibody and mutated it, screened the variants, and kept the ones that worked better. The protein you started from constrained everything you could reach. Designing a brand-new protein from a blank backbone — one with no evolutionary ancestor — was the kind of thing the field talked about as a someday goal.
Someday arrived. The current generation of generative models designs proteins de novo: backbones that have never existed, folded to bind a chosen target or catalyze a chosen reaction. The engine of this shift is the same diffusion machinery now reshaping structure prediction, pointed in the opposite direction. Instead of going from sequence to structure, it generates structure from nothing and then works out a sequence that will fold into it.
The three-model pipeline#
The thing to understand first is that there is no single model that designs a protein. There is a pipeline of three, each doing a job the others cannot, and the discipline is in how they hand off.
RFdiffusion proposes the backbone#
RFdiffusion, from the Baker Lab at the University of Washington, is a generative model built by fine-tuning the RoseTTAFold structure-prediction network on a denoising task. It starts from a cloud of noise — residues scattered in space — and iteratively denoises them into a coherent protein backbone. Crucially, it can be conditioned. Point it at the surface of a target protein and ask for a backbone that docks against a specific patch, and it generates folds shaped to that interface. Across topology-constrained monomer design, binder design, symmetric oligomers, and enzyme active-site scaffolding, it outperformed the prior generation of design tools, including a binder reported at picomolar affinity produced by computation alone.
What RFdiffusion produces is geometry only — a chain of backbone coordinates with no amino acids assigned. It has decided the shape. It has not decided what the protein is made of.
ProteinMPNN writes the sequence#
That second question — given this backbone, what sequence of amino acids will actually fold into it — is what ProteinMPNN answers. It is a fast graph neural network that reads the backbone geometry and outputs a sequence predicted to adopt it, and it has largely displaced the older physics-based Rosetta approach for this step because it is faster and empirically more reliable. For designs that include DNA, ligands, or metal ions, LigandMPNN extends the same idea to reason about those non-protein neighbors when choosing residues.
The division of labor is clean and deliberate. RFdiffusion owns the fold. ProteinMPNN owns the chemistry. Keeping them separate means each can be swapped or upgraded without rebuilding the other — the same modularity we insist on in any Operational Automation pipeline, where the worst architectures are the ones that fuse two concerns into one un-testable block.

AlphaFold2 filters the designs#
Now you have thousands of candidate sequences, almost all of which are wrong. The third model is the filter. You take each designed sequence, run it through AlphaFold2 as if it were a natural protein, and ask: does the predicted structure match the backbone you designed? This self-consistency check is the workhorse of the whole field. Metrics like pLDDT, backbone RMSD, and the predicted aligned error across the interface turn out to be reliable indicators of whether a design will bind in reality, which means most of the filtering happens in silico, before anything touches a bench.
This is the part that makes the economics work, and it is worth dwelling on because it is the opposite of obvious. AlphaFold2 was trained to predict the structures of real proteins. The bet that it also scores synthetic ones usefully was not guaranteed — and it is the bet that makes computational protein design a practical engineering discipline rather than a curiosity.
The validation loop is the whole point#
Here is the part the press releases skip. None of these models knows whether a protein works. They predict geometry and self-consistency. Whether a binder actually binds, whether an enzyme actually catalyzes, is a wet-lab question, and the answer comes back as a yield number that humbles everyone.
The loop, stated plainly: generate thousands of designs, filter in silico down to a tractable shortlist, express the survivors in cells, and measure. Then feed what you learned back into the next round. A widely cited early binder campaign filtered roughly 70,000 designs computationally, screened around 3,400 of them experimentally by yeast display, and recovered seven real binders — a hit rate near 0.2%. That sounds dismal until you remember the alternative was a hit rate of essentially zero by any prior method.
The trajectory since is the real story. As the filtering metrics improved, the number of designs you must physically test to find a good binder dropped by orders of magnitude, to the point where recent pipelines can screen fewer than 100 designs and still recover high-affinity binders. The models did not get better at designing proteins in some abstract sense. They got better at knowing which of their own designs to throw away. The win is in the filter, not the generator.

Where it works and where it does not#
Binders against a defined protein surface are the mature case — the field has produced and experimentally confirmed many, and the method now extends to designing antibodies, long considered the hardest target because of their flexible loops. Enzymes are harder. Binding is a geometry problem; catalysis is a geometry-and-dynamics-and-chemistry problem, and arranging a backbone so that catalytic residues sit in exactly the right place to lower a transition state is a far steeper ask. Progress is real — the latest open-source generation of these models is now designing enzymes and DNA binders at the all-atom level, and runs an order of magnitude faster than its predecessor — but enzyme design remains the frontier, not the routine.
The honest framing for anyone scoping a program: treat de novo design as a way to dramatically widen the funnel of plausible candidates, not as a way to skip the funnel. The models change how many shots you get and how cheaply, not whether you have to take the shots.
There is also a sober operational point hiding in the success-rate numbers. A pipeline that recovers binders from fewer than 100 tested designs is only that efficient if your expression, purification, and binding assays are themselves reliable and fast. The computational half can propose candidates faster than most labs can characterize them, and the bottleneck quietly migrates from the model to the bench. Teams that invest only in the generative models and neglect the throughput of their validation loop end up with a backlog of untested designs and no way to close the feedback that made the models good in the first place. The loop is a system, and a system runs at the speed of its slowest stage.
Why the order of the pipeline matters#
It is tempting to think you could collapse these three models into one — a single network that takes a target and emits a finished, sequenced, validated binder. The field has not, and the reason is instructive. Each stage is trained on a different objective and fails in a different way, and keeping them separate means each failure is visible and correctable. RFdiffusion can propose a beautiful fold that no sequence will ever realize. ProteinMPNN can write a sequence that scores well yet folds into something subtly wrong. AlphaFold2 can be fooled into confidence by a design that exploits its blind spots. Chained together, with an independent check at each handoff, those failure modes partly cancel: a design has to survive three different models that are wrong in three different ways before it earns a place on the bench. A monolith would hide exactly the disagreements that the pipeline surfaces. This is a general principle of reliable systems, not a quirk of biology — when you can decompose a hard prediction into stages with independent error, do it, and put a gate between each stage rather than trusting one end-to-end black box.
The engineering lesson generalizes#
Step back from the biology and the architecture is one we deploy constantly. A generator proposes candidates. A learned filter scores them against a cheap proxy for the expensive ground truth. A small, costly validation step confirms the survivors and produces labeled data that sharpens the next round. The value is almost entirely in the filter and the feedback loop, not in the raw generative step that gets the headlines.
That pattern shows up far from a wet lab. A Data Platform that flags anomalous transactions, a School ERP that predicts which students are at risk, a forecasting service inside a logistics stack — all of them live or die on the quality of the filter and the tightness of the loop between prediction and confirmed outcome, not on the cleverness of the model that emits candidates. Protein design is a particularly vivid instance because the validation step is so visibly, physically expensive. It forces the discipline that cheaper domains let teams skip: a model is only as useful as the loop you build around it.
Generators are easy; filters and feedback loops are the engineering. We build the generate-score-validate pipelines that turn a model’s output into something you can trust. Talk to our engineers.