Agent Evaluation: What to Measure Beyond Task Completion

Task completion rate is the most over-cited and least useful agent metric. The five evaluation dimensions we score on every production agent.

Agent Evaluation: What to Measure Beyond Task Completion

“The agent completed the task” is the floor, not the ceiling. An agent can complete the task while spending 10x what it should, calling the wrong tool first, generating sycophantic explanations, or producing output that’s technically correct but operationally unusable. If you only measure completion, you’ll ship those.

Five dimensions we evaluate on every production agent.

1. Task correctness#

Did the output match the ground-truth result, within tolerance? For deterministic tasks (classification, extraction) this is straightforward — match the label, match the field. For generative tasks, define a rubric: completeness, accuracy, format compliance.

Golden set: 30–100 tasks covering the easy, median, and hard slices of your distribution. Hand-labeled. Versioned in your repo, not in a notebook.

2. Tool-call quality#

Did the agent pick the right tools in the right order? Did it call any unnecessary tools? Did it fail-and-retry productively, or fail-and-retry-the-same-thing?

Score:

  • Tool selection accuracy (did the chosen tools match the golden trajectory?)
  • Redundant calls (same tool, same args, called twice)
  • Wasted calls (called a tool then ignored the result)

This dimension catches drift that task-correctness misses. An agent that gets the right answer the wrong way is one model update away from getting the wrong answer.

3. Cost efficiency#

Cost per successful task, broken out by:

  • Tokens (input vs output)
  • Tool-call count
  • Wall-clock duration

Track p50, p95, p99. A median task costing $0.02 with a p99 of $4 means 1% of users are expensive — find them, see what triggers them.

4. Failure handling#

Inject failures into your eval set:

  • Tool returns empty
  • Tool returns malformed data
  • Tool errors out (timeout, rate limit, server error)
  • Required input is missing or ambiguous

Score: did the agent recover gracefully, escalate to human, or loop forever? Most production incidents come from the agent’s behavior on inputs the eval set never included. Add them deliberately.

5. Operational fitness#

Does the agent’s output integrate with downstream systems? An agent that produces a beautiful response that humans love but breaks the CRM’s JSON schema is operationally unfit.

Score on:

  • Schema compliance (does the output validate?)
  • Side-effect correctness (did the right record update?)
  • Audit completeness (is every action logged in a way you can replay?)

The eval cadence#

Per change: run the golden set in CI. Block merge on regressions. Weekly: run the full eval suite, including failure-injected tasks. Monthly: audit 50–100 production tasks against the eval rubric; new failure patterns become new golden cases.

If your eval cadence is “I’ll check when something breaks,” your eval is reactive, which means your evals are mostly post-mortems.

What goes wrong#

Eval-prompt coupling. The eval prompts the agent in exactly the way the production app does — except production has a wrapper that doesn’t appear in eval. Your eval scores the agent; production scores the wrapper. Always eval the deployed surface, not the underlying agent.

LLM-as-judge alone. Using GPT to grade GPT is acceptable for cheap iteration; relying on it for production gate decisions is not. Pair LLM-as-judge with deterministic graders for any binary you actually care about.

Eval rot. Golden set written in week one, never updated. Production drifts past the eval set within months. Refresh quarterly.

What we ship by default#

For agent engagements via our AI & LLM integration service:

  • Golden set of 30–100 tasks at launch
  • Five-dimension scoring (correctness, tool quality, cost, failure handling, operational fitness)
  • CI integration with merge gates
  • Monthly production audit feeding eval refresh
  • Failure-injected tasks in every suite

Eval discipline is what separates a demo from a system. Build it before you build the agent.


If your only agent metric is “did it work?”, you’re missing the metrics that matter. Our team installs production-grade eval suites across enterprise agent deployments. Tell us about your stack.