OpenTelemetry in Production: Tracing, Metrics, Logs Unified in 2026

OpenTelemetry has become the default observability standard. By 2026, every credible observability vendor supports OTLP (OpenTelemetry Protocol) as the wire format, instrumentation libraries cover essentially every language and framework, and the operational patterns are well-established. The question is no longer “should we use OpenTelemetry” but “how do we deploy it well.”

This post walks through the production architecture for unified telemetry — tracing, metrics, and logs — across services.

The three signals#

OpenTelemetry standardizes three primary telemetry signals.

Traces capture the path of a request through the system. Each trace consists of spans; each span has a parent-child relationship to other spans. Distributed tracing is the most-distinctive OTel capability — it works across services, languages, and infrastructure components.

Metrics capture aggregated measurements. Counters, gauges, histograms, summaries. OTel’s metrics model is mature; cardinality controls matter.

Logs capture structured log events. OTel’s logging story matured later than traces and metrics; by 2026 it’s production-ready but newer than the other two.

The signals are complementary. Traces tell you what happened across a request. Metrics tell you what’s happening at the system level. Logs tell you what happened at the event level. Sophisticated debugging crosses signals — a metric anomaly leads to specific traces, which reference specific log events.

The architecture#

A typical OTel production deployment has four layers.

Instrumentation layer. OpenTelemetry SDKs in each service. Auto-instrumentation libraries cover the standard frameworks (HTTP servers, database clients, message queues); manual instrumentation adds business-specific spans.

Collector layer. OpenTelemetry Collector — typically deployed as DaemonSet on Kubernetes or sidecar in non-Kubernetes environments. Collectors receive telemetry from instrumented services, apply processing (filtering, sampling, redaction), and export to backends.

Processing layer. Within the collector pipeline, processors transform telemetry. Common operations: dropping high-cardinality dimensions, sampling traces based on sampling rules, scrubbing PII, batching for efficient transmission.

Backend layer. Where the telemetry actually goes. Datadog, New Relic, Honeycomb, Dynatrace, Splunk, plus the open-source alternatives (Grafana stack, Jaeger, Prometheus, OpenSearch). OpenTelemetry’s vendor-neutral design means you can switch backends without re-instrumenting.

The sampling decision#

Traces are expensive. A high-traffic service produces millions of spans per minute. Storing all of them is operationally prohibitive; sampling is mandatory at scale.

Three sampling approaches:

Head-based sampling decides at the start of the trace whether to keep it. Simple to implement; produces consistent sampling across services in the same trace. The downside is you can’t sample based on what happened — you can’t say “keep all traces with errors” because the decision is made before the error occurs.

Tail-based sampling decides after the trace completes. Allows error-based or latency-based sampling. The downside is the collector needs to buffer traces until they complete, which adds memory and complexity.

Probabilistic + always-on for errors. A hybrid — sample most traces at low rate, always keep traces with errors or high latency. The pragmatic default for most teams.

For most production workloads in 2026, tail-based sampling through the OpenTelemetry Collector is the right architecture.

The cardinality discipline#

Metrics with high cardinality (lots of distinct dimension values) consume substantial storage and query cost. Common offenders: per-customer dimensions, per-session dimensions, per-request dimensions. The dimension is informative but produces hundreds of millions of distinct time series.

The discipline: declare cardinality budgets per metric. High-cardinality data goes to logs or traces, not metrics. Aggregate before adding to metrics — instead of one metric per customer, aggregate to customer-tier and have one metric per tier.

The cardinality issues catch teams by surprise; the bill comes due months after the instrumentation. Building cardinality discipline into the instrumentation review process prevents the issue.

The backend choice#

The major observability backends in 2026 all support OTel ingestion.

Datadog — substantial scale and feature breadth; pricing is the primary trade-off.

New Relic — public-listed, mature, with broad coverage.

Dynatrace — strong APM heritage, particularly for enterprise.

Honeycomb — observability-focused with strong high-cardinality support.

Grafana Cloud — open-source-friendly with Grafana, Loki, Tempo, Mimir, Pyroscope.

Self-hosted alternatives — Prometheus + Grafana + Loki + Tempo + Jaeger covers the basics; SigNoz and Uptrace are unified alternatives.

The choice depends on workload, organizational preference, and cost. OpenTelemetry’s vendor neutrality is genuinely useful — teams can switch backends without re-instrumenting, which produces real competitive pressure on backend pricing.

The cost reality#

Observability cost has been a continuing concern. The patterns that help:

Sampling at the collector layer.
Cardinality discipline for metrics.
Retention tiering — short retention for high-volume data, longer for aggregates.
Vendor competition — the multi-vendor support OpenTelemetry enables produces real cost pressure.

Teams that don’t engineer cost into the observability stack discover it through the monthly bill.

Where pdpspectra fits#

Our DevOps practice deploys observability stacks across diverse client contexts. OpenTelemetry is the default starting point; the backend choice is workload-driven.

OpenTelemetry is now the default. Talk to our team about your observability stack.

The three signals#

The architecture#

The sampling decision#

The cardinality discipline#

The backend choice#

The cost reality#

Where pdpspectra fits#

Related posts.

Observability and OpenTelemetry in 2026: Where the Stack Actually Stands

Agents in Slack: An Engineer's Read on Claude Tag

Plumbing-First AI: Why Implementation Is Mostly Data Engineering