Multimodal on a Power Budget: AI Inside Smart Glasses and Wearables
Engineer-to-engineer on the milliwatt constraints, sensor fusion, and on-device vs. offloaded inference behind always-on wearable AI.
The interesting engineering in wearable AI is not the model. It is everything around the model: the power rail, the thermal envelope against skin, the question of which milliwatt-hour you spend on perception and which you save for radio. A smart-glasses assistant that answers a visual question in two seconds is not a triumph of a language model. It is a triumph of plumbing — of moving the right bytes to the right compute at the right time, under a budget measured in single-digit milliwatts for the always-on path.
This is a domain where pdpspectra’s worldview holds almost literally: latency budgets are data problems, and the AI implementation is the easy part once the data path is honest. Let’s walk the stack from sensor to inference and look at where real products — Ray-Ban Meta, Apple Watch, the Qualcomm silicon underneath — make their tradeoffs.
The sensing stack feeding always-on AI#
A head-worn device carries three sensor classes that matter for multimodal perception. The camera gives an egocentric view — the assistant sees roughly what you see, which is the entire reason natural visual question answering works at all. The microphone array does beamforming and voice activity detection. The IMU (accelerometer plus gyroscope) supplies head pose, gesture, and motion context that gates everything else. On the wrist, swap the camera for an optical PPG stack and skin-contact electrodes, but the architectural shape is identical: a few continuous low-rate streams and one expensive high-rate stream you only turn on when you must.
The cardinal rule is that you never run the expensive sensor speculatively. The camera is the power villain on glasses; the ISP and image pipeline dwarf the audio front end. So the design question becomes: what cheap signal can authorize the expensive one? A wake word from the mic. A deliberate head turn or tap from the IMU. A manual capture. Each of these is a gate, and the gates live on different silicon than the main application processor precisely so the main processor can stay asleep.

DSP islands and the always-on perception loop#
The trick that makes “always-on” survive a battery the size of a coin is partitioning. Modern wearable platforms run continuous sensing on a low-power DSP or sensor-hub core — Qualcomm’s Hexagon DSP acts as exactly this kind of always-on island in their wearable and AR platforms — while the high-performance CPU and NPU stay power-gated until something earns the wake-up. Sensor fusion, motion detection, gesture recognition, and a compact wake-word model all run in that power-efficient environment independently of the main core. Vendors like Sensory and Aspinity push the same idea further, with wake-word and voice-activity detection running at sub-milliamp, and in some analog front ends microamp-level, current draw.
This is a staged-trigger architecture. Stage zero is an analog or micro-power MEMS path that decides “is this even speech.” Stage one is a small wake-word model on the DSP. Stage two wakes the NPU and, only now, the camera or the radio. Each stage has a tighter false-accept budget than the last, because each wake-up costs real energy. Get the stage-one threshold wrong and you do not get a worse user experience — you get a dead battery by lunch.
The engineering discipline here will look familiar to anyone who has built a real-time data platform. It is backpressure and admission control, applied to electrons. You are deciding, continuously, what is allowed to consume the next unit of a scarce resource.
The hard constraints: power, thermal, battery#
Three constraints dominate, and they fight each other.
Power. The always-on path has to live in a tiny envelope — think single-digit milliwatts averaged — because anything more drains a wearable cell before the day is out. The instant you fire the camera and NPU, draw spikes by orders of magnitude. So the entire system is designed around duty cycle: keep the average low by keeping the expensive states rare and short.
Thermal. A glasses temple or a watch back is pressed against skin. There is no fan, no heatsink worth the name, and a hard comfort ceiling on surface temperature. Sustained inference generates heat the device cannot shed, which means you are throttled not by compute but by thermals. This is why on-device large-model inference on glasses is bursty by necessity: you can sprint, you cannot run a marathon against bare skin.
Battery. The cell is small because the form factor demands it, and energy density has not magically improved. Every architectural decision — offload or not, which model size, how often to sample — is ultimately a negotiation with that fixed coulomb count.
These three turn every feature into a budget conversation. That is not a constraint to lament; it is the actual design space.
On-device vs. offloaded inference#
Here is the central tradeoff, and it is genuinely a tradeoff — no free lunch.
Meta’s own engineering writeups on Ray-Ban Meta describe a multi-tier architecture: some perception runs on the glasses, the paired phone does real work, and heavier multimodal reasoning goes to the cloud. The reported shape is that on-device commands like photo capture return in well under a second, while a full cloud-backed multimodal answer lands in a few seconds after hopping glasses → phone → cloud and back. That gap is the whole story. On-device buys you latency and privacy; offload buys you model capacity at the cost of both.
The Qualcomm AR1 and AR1+ Gen 1 platforms push the on-device frontier — their NPUs can run small language models such as Llama-3.2-1B directly on the glasses, which keeps simple assistance off the network entirely. But a 1B-parameter model is not a frontier model. So the honest architecture is a router: trivial, latency-sensitive, or privacy-sensitive requests stay local; anything needing real reasoning or fresh world knowledge gets shipped off. The hard part is the routing policy, and it is a data problem — you need to know, per request class, the on-device success rate, the tail latency, and the energy cost of each path. Without that telemetry you are guessing.
This is also where privacy stops being a slogan and becomes a system property. Meta’s design reportedly waits for a complete utterance before sending anything to the cloud, while still doing predictive on-device processing during the utterance for speed. That is the right instinct: do the speculative, cheap work locally, and make the network hop a deliberate, auditable event. For a wearable feeding regulated systems — health telemetry flowing into a Hospital Management System, or campus-safety signals into a School ERP — that boundary is exactly where your compliance and observability hooks belong.

Visual question answering and live translation pipelines#
Two flagship multimodal features show the pipeline clearly.
For visual question answering, the egocentric camera capture is the anchor. A frame (or a short burst) is captured on wake, run through the ISP, optionally pre-processed on-device, and paired with the transcribed query. Because the camera shares the user’s viewpoint, the prompt can be deictic — “what is this,” “how much sun does this plant need” — and the shared context just works. The optimization that matters is predictive: begin processing intent before the utterance finishes, pre-load models, and pipeline the vision and speech paths so they overlap rather than queue.
For live translation, the constraint is conversational latency. Meta has shipped real-time speech translation through the open-ear speakers, and the meaningful detail for engineers is the push toward offline translation. Offline means the ASR → MT → speech path runs locally, which removes the network tail entirely — the single biggest source of latency variance. You trade model quality for predictability, and in a face-to-face conversation predictability wins. This is the same lesson that shows up in every Operational Automation system we build: a fast, consistent answer beats a marginally better one that arrives unpredictably late.
Sensor fusion on the wrist#
Wrist wearables are the cleaner illustration of fusion-as-architecture because the signals are weaker and noisier. Apple Watch’s irregular-rhythm notification leans on photoplethysmography — optical measurement of blood-flow timing between beats — and only escalates to the electrode-based single-lead ECG when the user deliberately takes a reading. That is a two-tier sensing design again: a cheap continuous optical signal that gates an expensive, higher-fidelity, user-initiated one. Apple is careful to frame it as detection that should prompt a clinician visit, not diagnosis, which is the correct honesty boundary for a consumer sensor.
The IMU does the unglamorous but critical work of motion artifact rejection. PPG is wrecked by movement; without accelerometer-informed gating you get false positives every time the wearer walks. Fusion here is not a buzzword — it is the difference between a usable signal and noise. Activity classification, fall detection, and energy expenditure estimates all fall out of the same accelerometer-plus-physiology fusion running on a low-power core, with the application processor untouched most of the day.
The plumbing-first takeaway#
Step back and the pattern is consistent across glasses and watch: a cheap always-on tier that gates an expensive bursty tier, a routing decision between local and remote inference, and a relentless accounting of energy per request. None of that is model work. It is data-path engineering — the exact discipline pdpspectra brings to any AI implementation, on a wearable or in a data center.
Which means the non-negotiables travel unchanged. You need evals on the on-device models, because a wake-word false-accept rate is a measurable, regression-prone number, not a vibe. You need observability on the routing layer, because the local-vs-cloud split is where your latency and privacy guarantees live or die. And you need cost tracking, except here the currency is milliwatt-hours and skin temperature rather than dollars per token. Same plumbing, different units. Built to ship.
Sources: Engineering at Meta — building multimodal AI for Ray-Ban Meta glasses, Qualcomm Snapdragon AR1 Gen 1 platform, Engadget — Snapdragon AR1+ Gen 1 on-glass AI, Apple Support — heart health notifications on Apple Watch, Sensory — on-device wake-word technology.
Constraints are the spec, not the obstacle — bring us the power budget and the latency tail, and we’ll architect the data path that fits. Talk to pdpspectra