Robot Foundation Models: Vision-Language-Action
RT-2, pi0, Gemini Robotics and Figure Helix made robot control a foundation-model problem. The hard part isn't the model, it's the missing data corpus.
For a decade, robot manipulation meant one engineered policy per task. Pick the part, walk to the conveyor, place the part — each behavior hand-tuned, each environment a fresh integration project. That world is ending. The frame has shifted to vision-language-action (VLA) models: single neural networks that take camera frames plus a natural-language instruction and emit motor commands directly. The robotics field is doing what NLP did in 2019 — collapsing a zoo of task-specific systems into one generalist policy.
We don’t build humanoids. We build the data pipelines and back-office systems around teams that do, and we deploy the same class of model in less glamorous places. So here’s the engineering reality of robot foundation models in 2026: the architecture is largely solved, and the bottleneck is data that physics refuses to give us for free.
What a VLA model actually is#
The cleanest origin point is RT-2 from Google DeepMind, published in 2023. The trick is almost insultingly simple: treat robot actions as just another token vocabulary. RT-2 takes a pretrained vision-language model — PaLI-X or PaLM-E — and fine-tunes it so that gripper translations, rotations, and open/close commands are discretized into tokens the model predicts the same way it predicts words. Action generation becomes text generation.
The payoff is transfer. Because the backbone absorbed internet-scale image and text data before it ever saw a robot, RT-2 inherited semantic knowledge that pure robot data could never supply. It could pick “the extinct animal” from a lineup of toys, or move an object toward a logo it had only ever read about. That emergent generalization to novel objects and instructions — not raw success rate — is the entire point of the foundation-model approach.
Physical Intelligence’s pi0 refined the recipe. Instead of discretizing actions into tokens, pi0 bolts a flow-matching “action expert” onto a PaliGemma backbone, generating continuous action chunks at 50Hz. That matters for dexterity: token-based action decoding is coarse, while flow matching produces the smooth, high-frequency trajectories that folding laundry or bagging groceries actually require. At roughly 3.3 billion parameters, pi0 is small by LLM standards and trained across seven robot platforms and 68 tasks.
Then Gemini Robotics, released March 2025, brought a frontier VLM backbone to the problem and split the stack: Gemini Robotics-ER for embodied reasoning and spatial understanding, paired with an action model for low-level control. By June 2025 DeepMind shipped an on-device variant that runs locally on the robot — a tell that latency and connectivity, not just capability, now drive design.

The dual-system pattern is winning#
The most interesting convergence in 2026 is architectural. Figure’s Helix made it explicit with a dual-system design borrowed straight from cognitive psychology. System 2 is an onboard VLM running at 7-9Hz: it reads the scene, parses the instruction, and plans. System 1 is a reactive visuomotor policy running at 200Hz that turns System 2’s latent intent into motor commands across 35 degrees of freedom — fingers, wrists, torso, head.
This split exists because of an unavoidable tension. The reasoning you want from a 7-billion-parameter VLM is too slow to close a control loop; a 200Hz motor policy is too dumb to understand “clear the table except the mug.” Running them as separate clocks, with the slow system conditioning the fast one through a latent vector, resolves it. You see echoes of the same idea in pi0’s action expert and in Gemini Robotics-ER feeding a control head. If you are designing an embodied system today, this is the default skeleton.
Crucially, all three run on embedded onboard compute with no cloud round-trip. A robot that needs a network call to decide where to put its hand is a robot that freezes when the WiFi drops. The same discipline shows up in our own work: when we wire an inference service into an Operational Automation pipeline, we treat the network as an adversary, not an assumption.
The data bottleneck nobody can buy their way out of#
Here is the uncomfortable truth underneath every impressive demo. The models are good. The data is not there.
LLMs work because the internet is a pre-existing, near-free corpus of trillions of tokens. Robotics has no equivalent. There is no internet-scale archive of paired observations and actions — what the robot saw, and what its joints did in response, at control frequency. That data only exists if someone records it on hardware, and recording it is slow, expensive, and embodiment-specific.
Open X-Embodiment, the 2023 effort to pool data across 60-plus labs, is the field’s best attempt at a shared corpus. It gathers over a million trajectories. Sounds large — until you notice it is also brutally imbalanced: a large majority of real trajectories come from a handful of robot types, mostly Franka and xArm arms. A policy trained on that distribution generalizes across tasks far better than it generalizes across bodies. Cross-embodiment transfer remains hard precisely because the data is lopsided.
So where does the data come from? Three sources, each with a tax.
Teleoperation#
A human drives the robot through the task while every frame and joint command is logged. This yields the highest-quality demonstrations — clean, on-distribution, correctly labeled by construction. Helix was trained on roughly 500 hours of teleoperated demonstrations, augmented with a VLM that auto-generated hindsight instructions. The catch is obvious: teleoperation costs operator-hours linearly. There is no shortcut where one human generates a thousand demos an hour. This is why a teleop operation is, in headcount terms, closer to a data-labeling floor than a robotics lab.
Simulation#
Generate trajectories in a physics engine, where you can run thousands of environments in parallel for the cost of GPU time. The catch is the reality gap — a policy that is flawless in sim can fail on hardware because the simulator’s friction, mass, and latency don’t match the real robot. Sim is essential for scale and useless without discipline; it is enough of a problem that it deserves its own treatment, which we give it separately.
Human video#
The dream is to learn from the billions of hours of humans doing tasks on the open web. The problem is that video shows the what, not the how — no joint torques, no gripper state, no action labels, and a human hand that doesn’t map to a parallel-jaw gripper. It is a weak signal that helps pretraining and cannot, on its own, teach control.

Action representation is a real design choice#
One decision quietly shapes everything downstream: how you represent actions. RT-2’s discretize-into-tokens approach is elegant because it reuses the language-model machinery wholesale — the same decoder, the same loss, the same training stack. But discretization is lossy. Bin a continuous joint trajectory into tokens and you cap the precision and smoothness of what the policy can express, which is fine for coarse pick-and-place and painful for anything dexterous.
Flow matching, as in pi0, and diffusion-style action heads exist precisely to recover that lost resolution. They model the action distribution continuously and generate short trajectory chunks rather than single steps, which also helps with temporal consistency — the robot commits to a coherent motion instead of re-deciding every timestep. The cost is a more complex training objective and an inference step that is heavier than a single token decode. For a 50Hz control loop that heaviness is a real constraint, which is part of why the dual-system split exists: keep the expensive generative head off the tightest loop. There is no free answer here. Pick token actions when your tasks are coarse and you want to ride the LLM stack; pick a continuous head when dexterity is the product.
What this means if you’re building#
A pattern worth internalizing: pretraining changes the economics of the teleop budget. A policy that starts from cross-embodiment pretraining plus simulation converges on a new task with a few hundred to a couple thousand teleop episodes. The same policy trained from scratch can need an order of magnitude more. The foundation model isn’t replacing data collection — it’s making each collected episode count for more. That is the same argument we make for transfer learning in any AI Implementation: the base model multiplies the value of your scarce, expensive, in-domain examples.
Two engineering takeaways stand out.
First, own your data pipeline before your model. The teams shipping in 2026 treat demonstration capture, labeling, versioning, and replay as first-class infrastructure — the same rigor a serious Data Platforms group applies to a feature store. A VLA model is only as good as the trajectory dataset behind it, and that dataset is a living asset that needs schema, lineage, and quality gates. The same instinct that makes a Hospital Management System trustworthy — every record traceable, every change auditable — is what separates a robot data lake from a folder of rosbags.
Second, respect the embodiment boundary. A policy trained on one arm does not transfer cleanly to a different one; cross-embodiment is an active research frontier, not a deployment guarantee. Budget for fine-tuning on your exact hardware. The marketing implies generalist robots; the reality is generalist backbones that still need task- and body-specific adaptation.
The honest 2026 picture: VLA models have made robot software a foundation-model discipline, and the architectures — token actions, flow-matching experts, dual-system controllers — have largely converged. The constraint has moved entirely to data. Whoever builds the most efficient pipeline from physical interaction to training-ready trajectories, not whoever has the biggest model, wins the next round.
Building a robot data pipeline or evaluating a VLA stack for a real deployment? We design the trajectory infrastructure and inference services that sit under embodied AI. Talk to our engineers.