Three things every production AI system needs (that demos don't show)

Most AI demos look great. Most AI in production doesn't. The gap is three pieces of infrastructure nobody mentions in the launch tweet.

Three things every production AI system needs (that demos don't show)

There’s a strange asymmetry in AI right now. Demos are trivially easy. Production is brutally hard.

You can build an impressive RAG demo in a Saturday afternoon. The model picks up your docs, answers questions, looks smart. Then you try to ship it to actual users and discover the three pieces of infrastructure that nobody mentioned in the launch tweet.

Here are the ones we put in every production AI system before we’ll call it done.

1. Evals as code

Most teams test their AI system by trying it. They ask it five questions, the answers look good, they ship. Then a model upgrade silently regresses one of those answers and they find out from a customer.

Evals are tests for AI systems. A fixed set of inputs, expected outputs (or judging criteria), and a number at the end that tells you whether the system got better or worse.

The non-obvious thing: you need evals before you have a model that works. Otherwise, every prompt tweak is a guess, every model swap is a roll of the dice, and you have no objective answer to “is this version better?” The eval is the gate that lets you iterate confidently.

We typically wire this up as a CLI step in CI. Pull request changes a prompt → evals run → if the score drops below the threshold, the PR doesn’t merge. Same discipline as any other automated test.

If your AI system doesn’t have evals, you’re flying blind. The model could be silently degrading and your only signal will be a customer complaint.

2. Observability per inference

When something goes wrong in a traditional app, you look at logs. When something goes wrong in an AI app, you need to look at:

  • The exact prompt that was sent (including system prompt, user message, and any context retrieved)
  • The full model response (not just the final answer — the whole thing)
  • Which model version answered, with which temperature, which tools available
  • How long it took, how many tokens cost what
  • Whether the user accepted the response, regenerated, or abandoned

This isn’t optional. When a customer complains “the AI got it wrong”, you need to be able to find that specific inference and explain what happened. Without per-inference observability, every bug report becomes folklore — “users say it’s bad sometimes” — and you have nothing to fix.

LangSmith, Helicone, Phoenix, and a half-dozen open-source options handle this. Pick one, wire it in on day one. Don’t wait for production traffic to be the trigger.

3. Cost and latency tracking — with alarms

LLM systems have a habit of getting expensive in non-obvious ways. A retry loop on a 1000-token prompt that costs 2¢ becomes a $40,000 bill when a deployment bug triggers a million retries overnight.

You need three things:

  1. Per-feature cost attribution. Not just “this month’s OpenAI bill was $X” — which feature spent it. Tag every inference call with the user, feature, and request type. Then dashboards roll up.
  2. Latency percentiles, not averages. p50 and p95 matter; the average lies. Especially for chat, where a 12-second response feels broken even if your “average” is 3 seconds.
  3. Alarms that page someone. Token spend per hour, error rate, latency degradation — wire them into PagerDuty or whatever you use. You’ll find out about regressions in minutes, not at month-end.

None of these are exotic. They’re the same instincts you’d apply to any production system. AI doesn’t get a pass just because it’s the new shiny.

The pattern

If you squint at all three, they’re really one thing: AI systems are software systems. The discipline you’d apply to a payment processor or a search engine applies here. Evals are unit tests. Observability is logs and traces. Cost/latency tracking is APM.

Most AI failures we’re called in to debug aren’t model failures. They’re missing infrastructure. The model is doing its best with no eval to tell it whether it’s doing well, no observability to tell anyone when it isn’t, and no alarms when something goes sideways.

The teams that ship AI reliably aren’t the ones with better prompts. They’re the ones who treated the system like production software from week one.


We help teams ship AI to production with the infrastructure that actually keeps it running. If you’re stuck somewhere between demo and reliable, our AI & LLM integration service is built around exactly this. Or tell us what’s broken and we’ll see if we can help.