Apache Kafka in Production: The Patterns That Avoid Operational Pain
Kafka is easy to demo and hard to operate. Six patterns we use on every production Kafka deployment — and the failure modes that justify each.
Kafka has a particular failure mode in industry: teams adopt it because it’s “the streaming standard,” set up a three-broker cluster from the docs, and then discover six months later that they’re running a distributed system they don’t understand. The cluster works fine — until it doesn’t, and the on-call rotation finds out that “Kafka best practices” is a six-thousand-page surface.
We’ve operated Kafka for hospital event pipelines, banking transaction streams, and logistics telemetry. Here are the six patterns we apply on every production Kafka deployment — and the failure modes that explain why each one exists.
Pattern 1: managed Kafka unless you have a reason#
For 90% of production Kafka deployments, we recommend Confluent Cloud, AWS MSK Serverless, or Aiven over self-hosted. The reason is unsentimental: operating Kafka well requires a small team that understands JVM tuning, KRaft consensus (or ZooKeeper, if you’re still there), partition reassignment, ISR shrinkage, and the rebalancing protocol intimately. Most data teams don’t have that. The managed providers do.
The “we’ll save money self-hosting” math almost never works out once you include:
- One engineer’s time learning Kafka operations (months)
- Pager duty for cluster incidents
- Capacity planning for the next 2 years
- Disaster recovery testing
- Version upgrades
- Security patching
Self-host Kafka if you’re doing something genuinely unusual — extremely high throughput (multi-GB/s), regulatory requirements that prohibit managed services, or you already have a strong Kafka-ops team. Otherwise, pay the managed-service premium and spend the team time on your actual product.
Pattern 2: choose your partition count carefully — and once#
The number of partitions on a topic is a decision you’ll live with. You can add partitions later, but you can’t remove them, and adding them breaks any consumer that relies on key-ordering (which is most of them).
Our default sizing:
- High-throughput topics (events/s in the 10k+ range): start with 24 partitions. Enough headroom to scale consumers to ~24 instances later without re-partitioning.
- Medium-throughput topics (under 10k events/s): 6-12 partitions.
- Low-throughput config / control topics: 3 partitions. Don’t pay the metadata overhead for topics that see 5 events/day.
The mistake we see most often is “we’ll just use 100 partitions for everything.” A cluster with 50 topics × 100 partitions × 3 replicas = 15,000 partitions, which is a real metadata load and slows down recovery and rebalance dramatically.
The other mistake: 1 partition per topic. Now you can never scale consumers past one instance and you’ve lost the main benefit of Kafka.
Pattern 3: producers that don’t lose data#
Every production producer config should set:
acks=all # wait for all in-sync replicas
enable.idempotence=true # exactly-once semantics within a producer session
max.in.flight.requests.per.connection=5 # default is fine; idempotence requires ≤5
retries=2147483647 # let the client retry indefinitely
delivery.timeout.ms=300000 # 5 min, then give up
compression.type=zstd # zstd > snappy > gzip for most workloads
linger.ms=20 # batch up writes for efficiency
batch.size=131072 # 128KB
The defaults Kafka ships with optimize for throughput, not durability. The defaults above flip that — they accept slightly higher latency per produce call in exchange for “if my producer returned success, the message is durably written across replicas.”
For the hospital and banking workloads where we deploy Kafka, the durability tradeoff is non-negotiable. For analytics-style workloads where a few lost events don’t matter, you can relax some of these (especially acks=1). Be explicit about it — and write down why.
Pattern 4: schemas, not strings#
A Kafka topic without a schema is a CSV file that’s harder to read.
We use Avro with the Confluent Schema Registry (or Protobuf, fine substitute) for every production topic. The schema is checked into git, registered before the producer ships, and evolution rules are enforced — backward-compatible by default.
The reasons are operational:
- Consumers can be written months after producers and not break when the producer adds a new field.
- Schema evolution is checked at registration time, not at runtime when your consumer crashes on an unknown field.
- Tooling works. Schema-aware tools (
kcat, kafka-ui, ksqlDB) can decode and display events. Without a schema, you’re reading bytes. - Avro/Protobuf is smaller on the wire than JSON — by 3-10x for typical payloads.
The bonus: schemas are the closest thing you have to documentation for your event streams. The schema for OrderShipped tells a future engineer what’s in the event without reading producer code.
Pattern 5: consumer group hygiene#
Kafka’s consumer group protocol is the source of most “Kafka is acting weird” tickets. Three operational disciplines fix 90% of them.
Static membership. Set group.instance.id per consumer instance. Without this, every consumer restart triggers a full rebalance — your throughput craters for 30-90 seconds while partitions are reassigned. With it, restarts of less than session.timeout.ms don’t trigger rebalance at all.
Cooperative rebalancing. Set partition.assignment.strategy to org.apache.kafka.clients.consumer.CooperativeStickyAssignor. This switches from “stop the world and reassign everything” to “incremental rebalancing where only affected partitions move.” Massive improvement for any consumer group with more than a few members.
Explicit offset commits. Set enable.auto.commit=false and commit offsets explicitly after processing succeeds. Auto-commit on a 5-second interval means a crash can replay up to 5 seconds of messages. Explicit commit-on-success means at-least-once delivery with bounded duplication, which is the right tradeoff for most workloads.
For consumers writing to a destination that supports idempotency (Postgres with ON CONFLICT, ClickHouse with ReplacingMergeTree, anything with a natural primary key), this gives you effective exactly-once semantics without the operational cost of Kafka transactions.
Pattern 6: monitor the things that actually break#
The four metrics we alarm on for every Kafka deployment:
-
Consumer lag, per group per partition. The single most important metric. Alarm when a consumer falls more than N minutes behind on a topic (we default to 5 min for operational topics, 30 min for analytics). This catches stalled consumers, slow downstream systems, and broker issues.
-
Under-replicated partitions. Should be zero. Anything above zero means a broker is sick or a partition is rebalancing — investigate.
-
Producer error rate. Buffer-pool exhaustion, retries-exhausted errors, network errors. These often precede a bigger incident by 10-30 minutes.
-
Broker disk usage. Kafka uses disk as a real component, not a fallback. Disk-full = cluster-down. Alarm at 70%; page at 85%.
Skip the deep JVM metrics (heap usage, GC time) for first-tier alarms unless you self-host. Managed Kafka providers handle the JVM internals; you don’t need to.
Patterns we no longer recommend#
A few patterns we used to use and have stopped:
-
Kafka Connect for arbitrary integrations. It works for the canonical sources (Postgres CDC via Debezium, S3 sink), but for anything custom, write a real consumer in your language of choice. Connect’s operational model — JVM-based, separate cluster, plugin uploads — is more pain than the convenience is worth.
-
Avro without a Schema Registry. Tried “just check the schemas into the repo.” Doesn’t work — version mismatch between producer and consumer becomes a runtime issue. Use the registry.
-
Tiered storage as a primary design. The feature is real and the price savings on cold data are nice, but tiered storage adds operational complexity. Use it on the topics where it pays off, not by default.
-
Self-hosting because “Kafka is open source.” See pattern 1.
What we use Kafka for, and what we don’t#
Kafka shines for:
- High-throughput event streams where ordering within a key matters (transactions per customer, events per device).
- Decoupling services that don’t share a database. Producer writes; consumer reads when it can.
- Replay-capable history. Retention of 7-30 days means a new consumer can backfill from history.
- Multi-consumer fanout. One producer, many consumers, each at their own pace.
Kafka is wrong for:
- Request-response. Use REST or gRPC.
- Long-term storage. It can technically hold years of data, but you’re paying disk costs for a log when an object store would be cheaper.
- Cross-team async messaging at small scale. A managed queue (SQS, Pub/Sub, Service Bus) is simpler and cheaper if you don’t need ordering or replay.
- The “we’ll add Kafka for future flexibility” pattern. If you don’t have a current workload that justifies it, you’re paying operational cost for hypothetical benefits.
The pattern of patterns#
Kafka rewards teams that treat it as a real database — with schemas, capacity planning, monitoring, and operational rigor — and punishes teams that treat it as “a queue.” Most production Kafka pain we get called in to debug comes from the latter mindset.
The teams that ship Kafka well aren’t the ones with the cleverest stream-processing topology. They’re the ones with discipline about partition counts, producer configs, consumer group hygiene, and what they’re actually paging on.
Kafka rewards rigor and punishes vibes. If you’re standing up streaming infrastructure and want a sanity check on the topology, our data engineering team has shipped Kafka in production for hospitals, banks, and logistics. Tell us about the workload.