Tool-Use Design for Production AI Agents
Tool design is where agents succeed or fail. Five rules we apply to every tool surface before an agent goes near production data.
The model is now reliable enough that the failure mode has moved. In 2026, most agent failures are tool-design failures: tools too broad, schemas too loose, error handling too quiet, side effects too coupled. Get the tools right and a mid-tier model behaves; get the tools wrong and even a top-tier model misbehaves.
Five rules we apply.
Rule 1: narrow over general#
run_sql(query) is general. get_customer_orders(customer_id, date_range) is narrow. The narrow tool has a smaller surface, validatable inputs, predictable cost, and an audit trail that means something.
Models pick the right narrow tool ~95% of the time. They pick the right SQL query maybe 70% of the time. The cost of writing narrow tools is paid back in reliability within the first week.
The exception: exploratory agents (data analysis, code review) genuinely need general tools. Even then, wrap them — run_sql_with_dry_run_first, not raw SQL.
Rule 2: schemas with semantic constraints#
A tool’s JSON schema should encode every constraint you can encode statically. Enums for closed choices. Format strings for IDs. Min/max for ranges. Descriptions that say when to use the tool, not just what it does.
Bad: query: string
Good: query: string // free-text search across the product catalog; for SKU lookup use get_product_by_sku instead
The model reads the description. Make it earn its keep.
Rule 3: idempotent writes with explicit keys#
Every write tool takes an idempotency key. Two calls with the same key produce one effect. This is the single most reliable defense against retry-induced duplication.
{
"name": "send_invoice",
"args": {
"customer_id": "C-1234",
"amount_cents": 50000,
"idempotency_key": "invoice-2026-02-Q1-c1234"
}
}
The agent generates the key from task identity. Replays during failure recovery are safe.
Rule 4: loud failures#
A tool that returns an empty list silently when its filter is malformed is a tool that will quietly destroy your evening. Tools should return structured errors that the agent can act on:
{
"error": "validation_failed",
"details": "customer_id 'C-1234' not found",
"suggestion": "use search_customers to find the correct id"
}
The suggestion field is underrated. It turns failures into recoveries.
Rule 5: read tools are cheap, write tools are expensive#
Most tasks should be 80% reads and 20% writes. Give the agent rich read tools so it can verify state before writing. Limit write tools to the specific actions the workflow needs.
A common pattern: every write tool has a “preview” sibling that returns what the write would do without doing it. The agent calls preview first, then write. This is the cheapest way to add a safety net.
What goes in production#
Before any agent touches production via our AI & LLM integration service, we verify:
- Every tool has a typed schema with constraints
- Every write tool is idempotent
- Every tool’s failure paths are exercised in evals
- No tool is more general than the task requires
- Read/write ratio matches the workflow
These are mundane checks. They prevent the dramatic incidents.
The framework question#
LangChain, LlamaIndex, the Anthropic SDK, OpenAI’s function-calling — they all let you define tools. The framework choice is mostly cosmetic. What matters is your tool design discipline, your eval coverage, and your audit logging. Pick the framework your team already uses; spend the time on tools.
Tools are the most undervalued lever in agent reliability. Our team audits tool surfaces and ships production agents with hardened action layers. Tell us about the workflow.