LLM Observability: LangSmith, Helicone, and What to Actually Log

Per-inference observability is non-negotiable for production AI. A field guide to LangSmith, Helicone, and the OSS options — what to log day one.

LLM Observability: LangSmith, Helicone, and What to Actually Log

When a customer says “the AI got it wrong,” the first thing you do is open the trace for that specific request. If you can’t, you can’t operate the AI system. Everything else — evals, prompt tuning, model upgrades — depends on per-inference observability being in place.

Most teams ship AI features without it, then add it after the first incident. We’ve stopped recommending that order. Here’s a practical guide to the LLM observability landscape in 2026 — what to log, what to use, and the decisions that age well.

What “LLM observability” actually means#

It’s three overlapping things:

  1. Per-inference tracing. For any specific call: what prompt went in, what came out, which model, which tools called, how long, how much.
  2. Aggregate metrics. Cost per feature per day. Latency p95. Error rate. Token usage trends.
  3. Eval results. For a fixed test set: how is the system scoring over time as prompts and models change.

Production AI needs all three. The tools differ in which they’re best at.

The landscape in 2026#

The category has consolidated meaningfully. Five tools cover most real deployments:

ToolBest atSharp edges
LangSmithTracing + evals, especially for LangChain usersMost ergonomic if you’re already in LangChain
HeliconeProxy-based logging, fast setup, cost trackingHosted product; self-host exists but less polished
Phoenix (Arize)OpenTelemetry-native, self-hostableNewer; ecosystem still maturing
LangfuseOpen-source LangSmith alternativeActive project; reasonable docs
Datadog / New Relic LLM ObservabilityDrop-in if you’re already thereLess depth than LLM-native tools

Honorable mentions: Weights & Biases Weave (good for ML-engineering teams already on W&B), OpenLLMetry (the OpenTelemetry semantic conventions for LLMs — works with any OTel backend), Pillar / Lakera (more on security/red-teaming).

How they fit#

LangSmith#

Best fit: teams using LangChain, or willing to use the LangSmith SDK directly.

Setup is plug-and-play if you’re on LangChain — set LANGCHAIN_TRACING_V2=true and trace appears automatically. For non-LangChain code, the @traceable decorator wraps any function. Datasets, eval runs, and prompt versioning are tightly integrated.

The killer feature is the eval surface. You define a dataset, define evaluators, run them on every commit (or on a schedule), and see whether your prompts/models are getting better. The UX for comparing two runs side-by-side is the best in the category.

The catch: LangSmith is a hosted product (Cloud or Enterprise). For projects with strict data-residency or air-gapped requirements, you’re on the Enterprise self-host path, which is more friction than Cloud.

Helicone#

Best fit: teams who want one-line observability and don’t want to refactor application code.

Helicone works as a proxy — change the base URL from https://api.openai.com/v1 to https://oai.helicone.ai/v1 and you have logging. That’s it. Cost tracking, request logs, rate limiting, caching — all enabled by the proxy.

The proxy model is the upside (easy) and the downside (you’re now routing your AI traffic through another service). For Helicone Cloud, that means trust in their availability and data handling. Helicone has an open-source self-host option if you’d rather run the proxy yourself.

For mixed-provider workloads (OpenAI + Anthropic + Bedrock + Together), Helicone’s universal proxy handles all of them with a consistent log shape.

Phoenix (Arize)#

Best fit: teams who want OpenTelemetry-native observability and the option to self-host without licensing friction.

Phoenix uses OTel semantic conventions for LLM calls, which means it interoperates with the broader observability ecosystem. You can ship traces to Phoenix and also to Datadog or Grafana from the same instrumentation.

It’s also a real eval and dataset tool, with notebooks for evaluating retrieval pipelines specifically. For RAG-heavy systems, Phoenix’s retrieval eval views are excellent.

Trade-off: smaller ecosystem than LangSmith. Fewer pre-built integrations. The “newer player” energy is real even though the product is solid.

Langfuse#

Best fit: teams who want an open-source self-hostable platform with most of LangSmith’s features.

Langfuse is Apache 2.0, self-hostable on your own Postgres, with a generous SaaS free tier. It covers tracing, datasets, evals, prompt management. The UX is good and getting better.

For projects where the AI workload itself isn’t huge but you need observability and don’t want to take on a SaaS dependency, Langfuse is the right answer.

Datadog / New Relic / etc.#

Best fit: teams who already have a strong APM tool and want to extend it.

The LLM-specific features in these tools are improving but lag the specialists. Trace correlation with the rest of your application (DB calls, HTTP requests) is the upside — you can see “this LLM call was inside this HTTP request which hit this DB query.” Real value if your AI is one feature of a larger system.

Use this if you’re already invested in the APM and want one less tool. Use a specialist if AI is a meaningful part of the product.

What to log on day one#

Whichever tool you pick, log these for every inference:

The prompt, in full. System message. User message. All prior turns. Any retrieved context. Any tool messages. The complete payload sent to the model. Hash it for indexing, store it verbatim for review.

The response, in full. Final assistant message. Any tool calls with their arguments. The reasoning trace if the model returned one. Finish reason (stop, length, tool_calls, etc.).

Model metadata. Which model (gpt-4o-2024-08-06, not just gpt-4). Temperature. Max tokens. Top-p. Tools available. Any other generation parameters.

Per-call metrics. Input tokens, output tokens, total cost. Latency. Provider. Any cache hit indicator.

Context. Which user, tenant, feature, A/B variant. Which request, which session, which workflow step.

Outcome signal. If the user accepted, regenerated, edited, abandoned. If a downstream check passed or failed. If the response was flagged. The “did this work” signal is what closes the loop on evals.

That’s the day-one minimum. It’s not exotic. The mistake is logging less than this and finding out you can’t debug a production incident.

The cost dimension#

LLM costs blow up in non-obvious ways. The dashboards we always wire up:

  1. Cost per feature per day. Tag every inference with feature_name. See which features are spending the money. A regression in one feature shouldn’t be invisible in a monthly aggregate.
  2. Tokens per request, distribution. Not the average — the distribution. p99 outliers are usually a retrieval that pulled in too much context, or a prompt template that grew over time.
  3. Cost per user. Cohort it. A small number of users often drive disproportionate cost; knowing who lets you decide what to do about it (rate-limit, upsell, cache).
  4. Cost vs revenue, if you can. For AI features that drive revenue, cost as a percentage of revenue from that feature is the right framing. “It costs $X” tells you nothing; “it costs 12% of the revenue it generates” tells you whether to optimize.

LangSmith and Helicone both have decent cost views. For richer cost analysis (cost per cohort, cost as % of revenue), exporting to your data warehouse and building a custom view is usually the right path.

The eval dimension#

Per-inference traces tell you what happened on one call. Evals tell you whether the system is getting better or worse over time.

The minimum viable eval setup:

  1. A golden dataset. 50-200 (input, expected outcome) pairs that represent real production traffic. Sampled from production logs once you have them; hand-curated before that.
  2. An evaluator. For each pair, score the system’s output. Can be: exact match (for structured outputs), LLM-as-judge (for free text), retrieval-specific metrics (recall@K, MRR for RAG), or a hand-written check.
  3. A CI hook. Run evals on every PR that touches prompts or model selection. Fail the build if scores drop more than X%.
  4. A dashboard of scores over time. So you can see model upgrades and prompt changes as inflection points.

LangSmith and Phoenix are best-in-class for this. Langfuse covers it. Helicone is weaker on evals (more focused on tracing and cost). For evals-heavy projects, weight that.

What we wire up by default#

For a new production AI feature, our default observability stack:

  • LangSmith if the project is LangChain-heavy and the team’s OK with SaaS.
  • Langfuse self-host if the data needs to stay in-house or the team prefers OSS.
  • Helicone as a proxy layer in front of providers, for cost and request logging — sometimes in combination with one of the above.
  • OpenLLMetry instrumentation with traces flowing to whichever backend, so we’re not locked into one vendor.
  • A few custom dashboards in Grafana for cost per feature and tokens-per-request distributions.
  • Evals running in CI against a golden dataset, with a slack alert on regression.

For the hospital and banking AI work where data residency is non-negotiable, self-hosted Langfuse + custom Postgres logging is the typical shape. For consumer-facing or internal-tooling AI features where SaaS is fine, LangSmith + Helicone is the path of least resistance.

The thing none of them solve#

Tooling logs the inference. It doesn’t tell you whether the answer was correct. That’s the eval problem, and the eval problem is still mostly your work — building the dataset, writing the judges, deciding what “correct” means for your domain.

The observability tools make the eval workflow tractable. They don’t replace the work.

The pattern of patterns#

The teams that operate production AI well treat it like any production system: traces for individual requests, metrics for aggregate behavior, alerts for regressions, evals as the gate that lets you change things confidently.

The teams that operate production AI badly treat it like magic — and find out from customers that the magic stopped working.

Pick a tool. Wire it up before you ship, not after the incident. The choice between LangSmith and Helicone matters less than the fact that you have observability at all.


You can’t operate what you can’t see. If your AI is shipped but blind, our AI & LLM integration team wires observability and evals into the systems we deploy. Tell us what you’re flying without.