AI Vaccine Design: From Sequence to Shot

How ML drives antigen selection, epitope and MHC-binding prediction, and mRNA sequence optimization — and where the pipeline breaks before a shot.

AI Vaccine Design: From Sequence to Shot

A vaccine is, from an engineering standpoint, a search problem wrapped in a manufacturing problem. You are searching a combinatorial space — which fragment of a pathogen to show the immune system, in what form, encoded by which nucleotides — and then you have to build the winner reliably and at scale. For most of the field’s history both halves were run by intuition and brute-force wet-lab screening. That is no longer true. Machine learning now drives the search, and in places it drives the build. This post walks the pipeline from antigen to encoded sequence, marks where the models are genuinely good, and is honest about where prediction still outruns protection.

The pipeline, stated plainly#

Designing an immunogen is four coupled decisions. Which antigen, or part of an antigen, do you present? Which epitopes within it will the immune system actually see? Will those epitopes bind the right MHC molecules across a diverse human population? And, for an mRNA or DNA platform, what nucleotide sequence encodes the chosen protein while surviving long enough to be translated? Each decision has a learned model attached to it now, and the interesting failures happen at the seams between them.

The reason this matters beyond academic interest is pandemic preparedness. The bottleneck in 2020 was not manufacturing mRNA — that turned out to be fast. The bottleneck was design certainty: knowing which sequence to commit a billion doses to. Compress the design loop and you compress the time between a novel pathogen sequence appearing in a database and a candidate entering trials. That is the real prize, and it is an AI implementation problem as much as a biology one.

Epitope and MHC-binding prediction is the mature core#

The most battle-tested part of the stack is predicting which short peptides bind major histocompatibility complex molecules. This is the gatekeeping step for T-cell immunity: a peptide that never gets presented on MHC is invisible to T cells, no matter how conserved or clever it looks on paper. The workhorse here is NetMHCpan, a pan-allele neural network that predicts peptide binding to any MHC class I molecule of known sequence. “Pan-allele” is the load-bearing word — it learns a shared representation across alleles and peptide lengths, so it can score allele-peptide pairs it never saw in training. That generalization is what makes it useful for populations whose HLA types are underrepresented in the training data.

The field has moved past hand-built motifs into protein-language-model territory. A systematic review of AI-driven epitope prediction in npj Vaccines catalogs the shift from position-specific scoring matrices to transformers and convolutional networks that rank B-cell and T-cell epitopes by immunogenicity, cross-strain conservation, and binding affinity in one pass. The scale of the underlying data is what changed: one recent effort assembled over 650,000 human HLA-peptide interactions to train T-cell epitope predictors, well beyond what any single assay campaign could produce.

Clear vials in a chilled laboratory rack beside a pipette

Two cautions an engineer should internalize. First, binding is necessary but not sufficient. A peptide can bind MHC beautifully and still fail to provoke a T-cell response, because presentation depends on upstream proteasomal cleavage and transport that the binding model does not see. The honest pipelines chain a processing predictor in front of the binding predictor rather than treating affinity as destiny. Second, benchmark accuracy on European HLA alleles overstates real-world accuracy. Performance degrades on alleles thin in the training set, which maps directly onto which populations a vaccine will protect. If you are building for global deployment, allele coverage is a fairness property, not a footnote.

Antigen and immunogen selection: generative, with guardrails#

Choosing the antigen used to mean picking the obvious surface protein and hoping. The newer move is to design a multi-epitope immunogen — stitch the highest-value epitopes into a single construct that the immune system reads as one antigen. Generative models, including GAN-style architectures, now propose peptide sequences that satisfy several constraints at once: antigenic, predicted-immunogenic, MHC-binding, and crucially low in self-reactivity so you do not provoke autoimmunity. A broad review of deep learning across vaccine target selection, design, and characterization lays out how these constraints get folded into a single scoring objective.

This is where I get opinionated. Generation is the easy part; the guardrails are the product. A model that emits a thousand candidate immunogens is worthless without a filter that rejects the self-reactive, the unmanufacturable, and the structurally implausible. Structure prediction earns its keep here — folding a proposed immunogen and checking that the epitopes you care about are surface-exposed and conformationally intact, not buried in a hydrophobic core where no antibody will ever reach them. The same Operational Automation discipline we bring to any data pipeline applies: the value is in the validated reject path, not the generous propose path.

mRNA sequence optimization is where AI clearly wins#

If one result should convince a skeptical engineer, it is mRNA sequence design. The protein you want fixes the amino acids, but the genetic code is redundant — most amino acids have several synonymous codons — so an astronomically large set of nucleotide sequences all encode the identical protein. Those sequences are not biologically equal. Some fold into floppy, unstable structures that degrade fast and translate poorly; others fold tight and persist.

LinearDesign, from Baidu Research and published in Nature in 2023, treats this as a joint optimization of mRNA stability and codon usage, borrowing lattice-parsing techniques from computational linguistics — finding the best sequence is analogous to finding the most likely sentence among similar-sounding alternatives. The reported results are hard to wave away: optimized constructs improved mRNA half-life and protein expression, and raised antibody titre in mice by up to roughly 128-fold versus a codon-optimization baseline for COVID-19 and varicella-zoster constructs, with the spike sequence solved in about eleven minutes. This is the cleanest example in the whole pipeline of a learned, formally-specified objective beating human heuristics on a metric that maps to a real outcome.

Why does this work so well when epitope prediction stays noisy? Because the objective is physical and well-defined — minimum free energy of folding plus codon adaptation — rather than a proxy for the messy biology of an immune response. The lesson generalizes: AI does best in vaccine design where the target is a computable physical quantity, and worst where the target is “will a human immune system respond,” which we cannot yet simulate end to end.

Benchtop thermal cycler and clear PCR tube strip on a clean lab bench

Where the pipeline breaks#

The seams are the problem. Each module is validated on its own benchmark, but a vaccine is the composition of all of them, and errors compound. A high-affinity predicted epitope that is never processed, presented on an allele your target population lacks, embedded in an immunogen that misfolds, encoded by an mRNA that degrades — every stage can pass its own test while the whole fails. There is no end-to-end benchmark that scores “sequence in, protective immunity out,” because that benchmark is a clinical trial, and clinical trials do not parallelize.

The second break is distribution shift, which for pathogens is not a metaphor — they mutate. A model trained on yesterday’s variants will quietly lose calibration on tomorrow’s, and unlike a recommender system you do not get a fast feedback signal telling you it has drifted. Conservation-aware design helps; targeting epitopes that the pathogen cannot mutate without paying a fitness cost is the right instinct, but conservation itself is estimated from historical data and inherits its biases.

The third is the one teams underweight: manufacturability and immunogenicity are sometimes in tension. The most stable mRNA fold or the tightest-binding immunogen may be the one that triggers innate immune sensing you did not want, or the one your process cannot transcribe cleanly. Optimizing any single objective to its limit tends to break a constraint you forgot to encode. Treat the whole thing as a constrained search with humans owning the constraints — the same way we treat a production Hospital Management System rollout, where the model proposes and validated rules dispose.

What to actually build#

If you are standing up a vaccine-design capability in 2026, the unglamorous advice holds. Wire the modules into one pipeline with explicit interfaces and versioned models, so you can trace any candidate back to the exact predictors that scored it. Keep a held-out wet-lab validation set the models never train on, and treat agreement with it as your only trustworthy metric. Audit allele and strain coverage as a first-class concern, not a final check. And accept the division of labor: let the models do the brutal combinatorial search across epitopes, codons, and folds, and let immunologists own the questions the models cannot answer yet. The win is not autonomy. The win is a design loop measured in days instead of months, with humans still holding the parts that matter.


Building a prediction-to-validation pipeline that won’t lie to you? That’s our wheelhouse — let’s talk.