Building a Production-Grade Operator with Kubebuilder
Kubernetes Operators encode operational expertise. The patterns that ship reliable Operators and avoid the common traps.
Kubernetes Operators encode operational expertise as code. The pattern is substantially powerful — automation of substantial operational tasks via continuously-reconciling controllers — but operator development has substantial pitfalls. Kubebuilder is the substantial framework for Go-based operators; it handles the boilerplate but doesn’t prevent the design and reliability mistakes that bring Operators down in production. This post walks through what we’ve learned shipping Operators at substantial clients.
What an Operator actually is#
An Operator is a controller that watches custom resources and reconciles cluster state to match. The model:
Custom Resource Definitions (CRDs). Define new resource types in Kubernetes — for example, PostgresCluster or BackupSchedule.
Custom Resources (CRs). Specific instances of CRDs that users create.
Controller. Watches CRs and takes actions to bring actual state into agreement with desired state.
Reconciliation loop. The substantial pattern — observe state, determine drift, take action, repeat continuously.
The substantial value is encoding operational expertise. Instead of operators (humans) running playbooks, the Operator (software) runs them continuously.
What Kubebuilder provides#
Kubebuilder is the substantial Go-based scaffold for Operator development. It generates:
- CRD scaffolding
- Controller-runtime integration
- Webhook scaffolding (validation, defaulting, conversion)
- Test framework setup
- Manifests for deployment
Substantial productivity gain over building Operators from scratch.
The substantial pitfalls#
Several patterns produce Operator failure in production:
Excessive scope. “Let’s build an Operator that manages our entire database lifecycle including provisioning, monitoring, backup, restore, scaling, upgrades, plus the various.” Substantial scope produces substantial complexity that’s substantially difficult to deliver.
Substantial state in CR status. Stuffing operational state into CR status field. Substantial fragility — status updates conflict, status grows beyond reasonable size, debugging substantial.
Reconciliation that doesn’t converge. Reconciliation loop that takes different action each time it runs, producing oscillation rather than convergence.
Missing idempotency. Reconciliation that’s not idempotent — running it multiple times produces different results than running once.
Substantial blocking operations in reconciler. Reconciler that blocks on long-running operations rather than returning and retrying.
Substantial inadequate error handling. Errors that propagate to CR status without specific actionable information.
Substantial unhandled edge cases. Operations that work in happy path but fail when underlying systems are in unusual states.
No backwards compatibility planning. CRD versions that break when changed; substantial customer pain on upgrade.
The substantial production patterns#
Several patterns consistently produce reliable Operators:
Substantial scope discipline. Start small. Operator that does one thing well beats Operator that attempts substantial scope and fails.
Substantial reconciliation discipline. Each reconciliation step is observable, idempotent, and bounded in execution time.
Substantial state externalization. Use ConfigMaps, Secrets, external databases for operational state. CR status is for high-level state only.
Substantial conditions and events. CR status uses Conditions (standard Kubernetes pattern); Operator emits Kubernetes Events for substantial actions.
Substantial finalizers. Use finalizers for cleanup — substantial pattern for resources that need external teardown.
Substantial webhooks. Validating webhooks prevent invalid CRs from being accepted. Substantially better than reconciler discovering invalid configuration later.
Substantial version strategy. CRD versions with conversion webhooks. Substantial pain to retrofit; build in early.
Substantial owner references. Resources created by Operator have owner references for proper cascade deletion.
Substantial metrics and observability. Operator exposes Prometheus metrics for reconciliation success, errors, duration.
The substantial testing dimension#
Operator testing has substantial layers:
Unit tests for reconciliation logic with fake client.
Integration tests with envtest (real API server, mocked controllers).
E2E tests in real Kubernetes cluster (kind, minikube, or real).
Upgrade tests — substantial verification that new Operator versions handle existing CRs from previous versions.
Chaos tests — substantial verification that Operator handles unexpected states (deleted resources, network failures, plus the various).
The substantial operational concerns#
Beyond development, operating Operators in production:
Substantial RBAC scoping. Operators frequently need substantial cluster permissions; substantial discipline to minimize.
Multi-tenancy. Operator that handles many CRs from many users needs substantial isolation thinking.
Substantial upgrade strategy. New Operator versions need careful rollout — Operator that breaks substantially can break substantial customer workloads.
Substantial observability. Operator state needs to be inspectable; reconciliation history needs to be auditable.
Substantial documentation. Custom resources need substantial documentation for users to understand what they’re configuring.
When to build vs use existing#
Several substantial cases for custom Operators:
Build when you have substantial unique operational patterns that off-the-shelf Operators don’t address.
Use existing when established Operators (Postgres operators, Kafka operators, Cassandra operators, plus the various) substantially cover your needs.
Extend existing when established Operators are 80% of needs.
Don’t build trivial Operators. Operator that just deploys a Helm chart isn’t worth the substantial machinery.
What we typically see at clients#
Common patterns:
No custom Operators. Most enterprises use off-the-shelf Operators (Postgres, monitoring, plus the various) but don’t build custom.
Custom Operators at substantial scale. Larger organizations build custom Operators for organization-specific patterns.
Substantial Operator-driven platforms at platform-engineering teams — Operators automate substantial platform capabilities.
Operator anti-patterns — Operators that are substantially over-engineered for what they accomplish.
Where pdpspectra fits#
Our DevOps practice builds production Kubernetes platforms including custom Operator development when substantial automation justifies it.
Related reading: the GitOps multi-cluster post, the K8s network policies post, and the Karpenter vs Cluster Autoscaler post.
Custom Operators encode substantial operational expertise. Talk to our team about your Kubernetes platform.