Protein Language Models: ESM and the Grammar of Life
ESM treats protein sequences like text, learning a grammar of life by masked prediction. What that buys you, what ESMFold and ESM3 deliver, and the limits.
A protein is a string. Twenty letters, the amino acids, written left to right from one end of the chain to the other. Stare at that fact long enough and an idea becomes irresistible: if it is a string, and we have spent a decade building models that learn the structure of strings, why not point those models at proteins? Train a transformer to predict masked amino acids the way we train one to predict masked words, and see what it learns.
That is the entire premise of protein language models, and the ESM family from Meta’s FAIR lab is the line that proved the premise pays. The bet was that evolution has written a grammar into protein sequences — rules about which residues can sit where, which combinations fold and which fall apart — and that a model trained only to fill in blanks would have to internalize that grammar to do the job. The bet was right. The interesting work is understanding exactly what “grammar of life” means here, what it buys you, and where the language metaphor quietly stops being true.
Masked prediction, no labels, no alignment#
ESM-2 is a transformer trained with one objective: mask out some amino acids in a sequence and predict them from the rest. The largest version reaches 15 billion parameters, trained on 86 billion amino acids drawn from roughly 250 million protein sequences spanning the breadth of known evolutionary diversity. No structural labels. No functional annotations. Just sequences and the fill-in-the-blank task, at scale.
To predict a masked residue well, the model cannot memorize. It has to learn that two positions far apart in the sequence are constrained to vary together because they touch in the folded structure, that certain motifs imply certain local geometry, that some substitutions are tolerated and others are lethal. These are the regularities evolution enforces, and masked prediction is a forcing function that drags them into the model’s weights. The payoff is the same one that made language models useful: the internal representations — the embeddings — turn out to be a dense, transferable encoding of what a protein is.
Embeddings are the real product#
For most engineering purposes the headline capability of ESM is not any single prediction. It is the embedding: a fixed-length vector for a residue or a whole protein that captures structural and functional properties learned during pre-training. Feed those vectors into a small downstream model and you can predict stability, function, localization, or the effect of a mutation, with a fraction of the labeled data a from-scratch model would demand.
This is exactly how we use foundation-model embeddings in any AI implementation outside biology. You do not retrain the large model. You treat it as a feature extractor and put a light, task-specific head on top. A protein embedding from ESM plays the same role in a bioinformatics Data Platform that a text embedding plays in a document-search stack: the expensive, general representation is computed once, and a hundred cheap downstream tasks ride on it. If you take one transferable idea from the protein-LM world into your own systems, it is this — the embedding is the asset, not the prediction.

ESMFold: structure from a single sequence#
The most dramatic demonstration that ESM-2 had learned real structure is ESMFold, which predicts a 3D structure directly from one sequence. This is a sharp contrast with AlphaFold2, which depends on a multiple-sequence alignment — a search across evolutionary relatives assembled at query time, which is slow and useless for sequences with few known homologs.
ESMFold needs no alignment. The evolutionary signal AlphaFold2 gathers at runtime from an alignment, ESMFold has already absorbed into the language model’s weights during pre-training. That trade has a cost and a benefit. The benefit is speed: skipping the alignment search makes ESMFold faster by a large margin, and on the hardest sequences — orphans with no relatives to align — it can work where alignment-based methods have nothing to chew on. The cost is accuracy: on proteins with deep, rich alignments, ESMFold is generally a step below AlphaFold2.
That trade is precisely what made a particular project possible. Meta turned ESMFold loose on metagenomic sequences — DNA pulled from soil, oceans, and guts, encoding proteins from organisms no one has cultured — and built the ESM Metagenomic Atlas, hundreds of millions of predicted structures for the “dark matter” of the protein universe. You could only attempt that at the speed a single-sequence model allows. Run an alignment search for hundreds of millions of unknown proteins and you are still waiting next year.
Picking the right tool#
The practical guidance is unglamorous and important. If you have a well-studied protein with many homologs and you need the best possible structure, alignment-based prediction is still the more accurate choice. If you are screening millions of sequences, or working with proteins that have no evolutionary neighbors, or you need embeddings rather than coordinates, the language-model route is the one that scales. This is a workload decision, not a quality ranking — the same posture we bring to any model selection, where the right answer depends on your throughput and your data, not on which model won a benchmark.

From reading to writing: ESM3#
The newest turn takes the language metaphor to its logical end. If a model can read protein grammar well enough to predict masked residues, can it write fluently enough to compose new proteins? EvolutionaryScale — founded by members of the original ESM team — launched ESM3 in June 2024 as a generative model that reasons jointly over sequence, structure, and function. You can prompt it with partial information in any of those three channels — fix a few catalytic residues, specify a fold, name a function — and have it generate the rest.
The proof point made the rounds for good reason. The team prompted ESM3 to design a new green fluorescent protein and produced esmGFP, a working fluorescent protein roughly 58% different in sequence from any known natural one — a gap the authors estimate would take on the order of 500 million years of natural evolution to traverse. A model trained on the proteins evolution did produce had generated a functional one evolution had not, sitting far outside the explored sequence space. That is the writing capability made concrete, and it is a genuinely different thing from prediction.
Scale was the unlock, and the warning#
One more thing the ESM line established is worth keeping in view, because it is a result about scale, not just about proteins. The capabilities — accurate single-sequence structure, embeddings that transfer cleanly to downstream tasks — did not appear at small model sizes and grow gradually. They sharpened as the models grew, the same scaling behavior the text-LM world documented, with the largest ESM-2 reaching into the billions of parameters before structure prediction became genuinely useful. That is the encouraging half. The cautionary half is that scale bought fluency in the training distribution, and nothing more. A bigger model is a better mimic of the proteins evolution has already explored; it is not a better physicist. The regularities it captures are statistical, learned from sequences that survived selection, and they grow more confident without necessarily growing more correct off that beaten path. Anyone porting the scaling lesson into their own domain should carry both halves: more parameters and more data reliably buy better performance on the distribution you trained on, and reliably buy nothing at all about ground truth you never showed the model.
Where the metaphor breaks#
It is worth being precise about the limits, because “language model for proteins” is a metaphor and metaphors mislead when you forget they are metaphors.
A protein is not actually a sentence. Its meaning is its three-dimensional fold and its dynamics in a watery, crowded cellular environment, not a sequence of tokens. The language model never sees that environment. It sees co-occurrence statistics in evolved sequences and infers structure indirectly. That is why these models are excellent at the regularities evolution has explored repeatedly and far shakier off that manifold — generating a sequence that looks statistically plausible is not the same as one that folds, stays folded, and works in a cell. The model’s fluency is a fluency in the patterns of the training distribution, and proteins that matter often live at its edges.
The second limit is the one that ends every honest discussion of computational biology: a confident prediction is a hypothesis, not a result. An embedding that scores a mutation as stabilizing, a structure predicted from a single sequence, a generated protein that the model is sure about — each is a starting point for an experiment, not a substitute for one. The metaphor that proteins are a language is productive precisely up to the point where you remember that languages do not have to fold, fail, or kill a cell, and proteins do. Use the models for what they are unreasonably good at — turning sequences into rich, cheap, transferable representations at scale — and keep the wet lab as the arbiter of truth.
The embedding is the asset, not the prediction. We build the Data Platforms and downstream pipelines that turn foundation-model representations into systems you can ship. Talk to our engineers.