Building Production AI Agents: The Architecture Patterns That Actually Ship

Most agent demos collapse under real traffic. Five architectural patterns we use to ship AI agents that survive contact with actual users.

Building Production AI Agents: The Architecture Patterns That Actually Ship

The agent demo is one of the most misleading artifacts in modern software. A model with five tools and a goal works beautifully in a notebook. The same agent in production, on real user traffic, hits failure modes the demo never surfaced: tool calls that succeed but return garbage, infinite tool-call loops, context windows that overflow on the 12th turn, retries that double-charge a customer.

We’ve shipped production agents for hospital intake, banking compliance review, and logistics dispatch. Almost none of them look like the LangChain “ReAct” tutorial. Here are the five patterns we use to keep agents shipped.

Pattern 1: deterministic shell, agentic core#

The first thing we strip out of most demos is the assumption that the agent runs the whole show.

In production, the outer layer of an agent system should be deterministic code. Validate inputs, check permissions, fetch context, log the start of the turn. Then hand off to the agent. When the agent returns, the deterministic shell validates the output, logs the end of the turn, and writes the result.

[deterministic] auth + input validation

[deterministic] context fetch (Postgres, vector store, APIs)

[AGENTIC]      reason + call tools + produce structured output

[deterministic] output validation + side-effect application

[deterministic] audit log + telemetry

This isn’t pedantry. It’s the only way to bound an agent’s blast radius. The agent doesn’t write to your database — the deterministic post-step does, after validating the agent’s structured proposal against a schema and your business rules.

For a banking compliance review agent, this looks like: the agent proposes a classification + rationale + supporting evidence. The deterministic shell verifies the cited evidence actually exists, the classification is one of the allowed values, and the rationale fits the length constraints. Only then does it write to the audit table.

Pattern 2: typed tool contracts, typed agent output#

The agent shouldn’t speak in prose. It should speak in structured types.

For every tool, we define:

  • Input schema (Pydantic / Zod). Required, validated before call. If the agent malformes the input, the call fails fast with a useful error the agent can read on the next turn.
  • Output schema. Tools return typed responses, not free-form strings.
  • Error schema. Failures return a TypedError the agent can route on (AUTH_FAILED, RECORD_NOT_FOUND, RATE_LIMITED, RETRY_LATER), not a 500 with a stack trace.

For the agent’s final output, we use the same discipline. with_structured_output(Decision) instead of “please respond in JSON.” The model’s last turn must produce a value that satisfies the schema or the turn fails.

class IntakeDecision(BaseModel):
    triage_level: Literal["urgent", "routine", "non_clinical"]
    primary_symptoms: list[str]
    suggested_department: Department
    confidence: float = Field(ge=0, le=1)
    needs_human_review: bool
    rationale: str = Field(min_length=20, max_length=600)

The schema is the contract between the agent and the surrounding system. The agent can take any path it likes inside the loop — but it leaves through a typed door.

This single discipline kills 70% of the “agent did something weird” bugs. The other 30% are in the tools.

Pattern 3: explicit state in a graph, not implicit state in a list of messages#

The first generation of agent frameworks treated state as “the conversation so far.” This works for demos. It does not scale.

For non-trivial agents we use LangGraph (or a similar graph-based orchestrator) where state is an explicit Pydantic model and nodes are explicit functions. The graph captures the real shape of the workflow:

  • Which nodes run when
  • What state they read and write
  • Where the conditional branches are
  • Where human-in-the-loop pauses live
  • Where retries happen, with what budget

This makes the agent debuggable. A failing run isn’t “the model went weird” — it’s “the verify_evidence node returned EVIDENCE_NOT_FOUND on turn 3, and the retry_with_broader_search branch ran out of budget.” You can replay it, modify it, set breakpoints.

The cost: you write more orchestration code. The benefit: when something breaks at 2am, you can find it.

For agents simpler than this — say, a doc-Q&A agent that retrieves and answers — a graph is overkill. A single LLM call with retrieval and structured output is fine. But the moment you have branches, retries, or multi-step state, graphs are the saner unit of abstraction.

Pattern 4: hard budgets, not soft hopes#

Every production agent we ship has hard, enforced budgets:

  • Max turns per request. If the agent hasn’t converged in N turns, abort and escalate to a human. We default to 8 for most workflows; 20 for research-heavy ones.
  • Max tokens per request. Sum of prompt + completion across all turns. Caps cost variance per request and protects against runaway loops.
  • Max tool calls per turn. Prevents the “loop of doom” where the agent calls the same tool 40 times in one turn.
  • Per-tool timeouts. If a tool hangs, the agent gets a TIMEOUT error and can route around it.
  • Per-request wall clock. Hard cap on how long a single request can take, agent or no.

Without these, you get a 3am page about a $400 single-request charge because the agent decided the answer was just one more search away. With them, the worst case is bounded and predictable.

These aren’t graceful degradation, they’re hard stops. The graceful degradation lives in the deterministic shell — when a budget trips, the shell returns a user-friendly “we’re escalating this to a human” and writes a high-priority alert. No agent gets to take the system down.

Pattern 5: observability that’s per-turn, not per-request#

If you can’t answer “what did the agent do on turn 3 of request X” in under a minute, you can’t operate an agent.

For each turn, we log:

  • The full prompt sent to the model (system + user + assistant + tool messages)
  • The full model response (including any reasoning content)
  • Token counts and cost
  • Tool calls made, with arguments and responses
  • Latency, broken down by model call vs tool call
  • The graph state before and after

Tools we use in production: LangSmith (best if you’re on LangChain), Helicone (great for OpenAI proxy patterns), Phoenix (open source, OTel native), or a homegrown Postgres + JSON setup for projects that don’t justify a third-party tool.

The bar to clear: someone on the team can pull up any agent run, walk through it turn-by-turn, see exactly what the model said and what each tool returned, and form an opinion about whether the agent made a reasonable choice. If that workflow takes more than two minutes, the observability isn’t good enough yet.

(We go deeper on this in our LLM observability piece.)

What we strip out of every demo#

Beyond the five patterns above, here are the things we routinely cut from impressive agent demos before shipping:

  • Open-ended tool descriptions. “Use this tool when you need to search.” Too vague — the agent uses it for everything. Replace with: “Use this tool only when looking up a customer by ID. Returns 404 if not found. Do not retry on 404 — escalate.”
  • Free-form planning steps. “Think step by step about what to do” produces poetry. Replace with: a planning node that outputs a typed Plan with at most 5 typed PlanStep objects, validated before execution.
  • Unbounded retries. “Try the tool again if it fails” is how the loop of doom starts. Replace with: explicit retry policies per tool, with budgets and backoff.
  • Self-correcting agents that critique their own output. Sometimes works. Often adds 4 turns of cost for marginal accuracy. Measure on your evals before keeping it.
  • Memory. A long-term memory store is a feature, not a default. Most demos don’t need it; many that include it use it wrong (rolling everything into a vector store and hoping). If you’re going to do memory, design it explicitly — what gets remembered, what gets forgotten, who decides.

A reference shape#

For a typical production agent we deploy, the shape looks something like:

  • Entry handler (FastAPI / API Gateway / Lambda). Auth, rate limit, input validation.
  • Context loader. Pull the relevant records from Postgres / vector store / external APIs. Cache the result for the duration of the request.
  • Graph orchestrator (LangGraph). Explicit state, ~5-12 nodes, conditional edges.
  • Tools. Typed inputs and outputs. Each tool is its own module with its own tests and its own retry policy.
  • Output validator. Pydantic schema. Anything the agent produces gets validated; failures route to a human-review queue.
  • Side-effect applier. Writes to DB, calls external services. Idempotency keys. Audit log.
  • Telemetry. Per-turn logs to LangSmith, request-level metrics to Prometheus / Datadog. Cost and token spend rolled up per feature.

This is more boilerplate than the demo. It’s also what survives a hospital’s compliance review or a bank’s change-control board.

The pattern of patterns#

Production agents are deterministic systems with an LLM as one component, not LLM systems with some code around the edges.

The framing matters. When you treat the LLM as the brain, every failure mode becomes “the model did something weird” and the fix is prompt engineering. When you treat the LLM as a constrained component inside a system you control, every failure mode is debuggable — and most of them turn out to be schema issues, missing budgets, or tool contracts that were too loose.

The teams that ship agents successfully aren’t the ones with the cleverest prompts. They’re the ones who spent more time on the surrounding code than on the agent itself.


The demo is the easy 20% of the work. If you’re past the demo and stuck on the production gap, our AI & LLM integration service is built around exactly that gap. Tell us what’s broken and we’ll see what’s recoverable.