SRE for AI Systems
AI systems break in ways traditional SRE doesn't anticipate. The disciplines for site reliability engineering on production AI.
Site Reliability Engineering for AI systems is different from SRE for traditional services. AI systems have different failure modes (model drift, hallucination, cost spikes, eval regressions) and different recovery patterns. Treating AI like another service produces incident response that misses what’s actually wrong.
The SRE disciplines specific to AI.
Failure modes that SRE has to catch#
Beyond traditional service health:
Quality drift. Model accuracy degrading over time without a code change.
Cost spikes. A misconfigured prompt that doubles token usage; a runaway agent loop; a vendor API that re-priced.
Hallucination at production. Model producing confidently wrong answers; users complain.
Eval-to-production gap. Performance in eval doesn’t match production. New inputs the eval missed.
Vendor degradation. Hosted model provider performance changes; effects cascade.
Memory bloat. RAG indexes growing; embeddings re-computed; storage cost climbing silently.
Compliance drift. Bias metrics shifting; audit gaps appearing.
Traditional SLOs (uptime, latency) don’t catch these.
The metrics that matter#
AI-specific SLOs:
- Quality SLO. Eval score above threshold. Measured continuously on a production-representative sample.
- Cost-per-task SLO. Median and p99 within budget bands.
- Hallucination rate SLO. Where measurable (RAG with citation accuracy, structured-output schema compliance).
- Latency SLO. Per-stage and end-to-end.
- Cache hit rate. Drops suggest changing input distribution.
The traditional uptime and latency SLOs still apply, but they’re not enough.
The runbook differences#
AI incidents don’t have the same playbook as traditional service incidents:
“Restart the service” doesn’t fix model drift. You need to investigate and potentially roll back the model.
“Scale up the service” doesn’t fix quality regression. It just produces more bad responses faster.
“Roll back” might mean rolling back the model, the prompt, the index, the routing config, or all of them.
Runbooks need to cover the AI-specific failure axes. Generic SRE runbooks miss them.
The observability stack#
For AI systems, beyond traditional observability:
- Inference traces. Inputs, outputs, model versions, costs per inference
- Eval results over time. Quality trend; alert on degradation
- Vendor-side metrics. Their reported latency and error rate (when available)
- Cost dashboard by feature and team
- Drift detection on inputs and outputs (see our drift detection notes)
Tools like LangSmith and Helicone cover much of this. Often paired with traditional APM for the broader system.
The incident-response patterns#
AI-specific incidents:
Quality regression. Detect via eval. Roll back model/prompt/index. Investigate cause. Patch. Re-deploy with new eval gate.
Cost spike. Detect via cost alerting. Throttle or disable feature. Identify cause (bad prompt? input distribution? vendor pricing?). Fix. Resume.
Hallucination wave. Detect via user reports or sampling. Add guardrails (RAG required, schema enforcement). Patch prompt. Eval. Re-deploy.
Vendor outage. Detect via vendor monitoring. Failover to alternate provider (if architected for it). Degrade gracefully if not.
Each pattern has different MTTR characteristics. Build muscle memory.
Where AI SRE fits organizationally#
In small orgs: one engineer does AI SRE alongside other SRE work.
In mid orgs: dedicated AI SRE within the broader SRE team.
In large orgs: AI Platform team has its own SRE function, plus AI infrastructure SRE.
The wrong answer: “the data science team is on call for the models they ship.” Data scientists usually lack SRE skills. The right answer is collaboration — SRE owns the operational discipline, data scientists own the model.
What we ship for AI-heavy teams#
For AI SRE engagements via our DevOps automation service:
- AI-specific SLO definition
- Observability stack covering quality, cost, latency, drift
- Runbooks for AI-specific incident classes
- On-call rotation appropriate for the AI surface
- Quarterly post-mortem review with model and prompt iteration plans
The enterprise context#
Our enterprise AI rollout roadmap treats SRE-for-AI as a Phase 3 capability — once you have multiple AI features in production, dedicated SRE function earns its place.
Skipping this produces AI features that “work in the demo” and degrade quietly in production. The downstream cost is large.
The 2026 maturity#
SRE for AI in 2026 is past the figuring-it-out phase. The disciplines are documented; the tooling is mature; the runbooks are reusable. The orgs that haven’t built this capability are running with operational risk that compounds.
AI SRE is its own discipline. Traditional SRE alone doesn’t catch AI-specific failure modes. Our team builds SRE practices for production AI systems. Tell us about the program.