Temporal: The Workflow Engine That Replaces Cron + Queues + State Machines
Temporal collapses three patterns into one: durable workflows that survive restarts, first-class retries, and an answer to homegrown orchestration.
Most production systems have an “orchestration mess” — a set of cron jobs, message queues, state tables, and retry policies that, together, run the long-lived business processes. Bank onboarding, hospital claim adjudication, logistics shipment lifecycle, multi-step AI agents. The pattern is so common that every team writes some version of it. Most teams write it twice — once badly, then once again after the first version causes an outage.
Temporal is what you write the third time, except someone already wrote it.
What Temporal actually is#
Strip the marketing: Temporal is a server that durably executes your application code. You write workflows as regular functions in Go, Python, TypeScript, Java, or .NET. Temporal handles:
- Durable execution. If your worker process dies mid-workflow, Temporal resumes the workflow exactly where it left off — same variables, same call stack — on a different worker.
- Timers as first-class.
sleep(72 hours)in a workflow is a real thing. The workflow goes to sleep for 72 hours; no resources consumed; resumes on schedule. - Activities with retries. Calls to external systems (databases, APIs, third parties) are activities, which Temporal automatically retries with configurable policies.
- Versioning. Workflows that started on v1 of your code can keep running on v1 logic even after you deploy v2.
- Visibility. Every workflow execution is queryable: status, history, current step, retry state. Built in.
The mental model: a workflow is a deterministic program; activities are everything non-deterministic. Temporal records every effect of a workflow run, so replaying the recording on a different machine reproduces the same state.
The problems Temporal solves#
Look at the patterns in a typical production codebase that Temporal collapses into one abstraction:
| Pattern in code | What Temporal replaces it with |
|---|---|
| Cron job that runs every N hours, looks for “stuck” records, retries | Workflow with a timer |
| Message queue + handler with custom retry logic | Activity with built-in retries |
| State table tracking “what step is this order on” | Workflow state variables |
sleep and setTimeout patterns for delayed actions | Native workflow.sleep() |
| Distributed lock for “only one of these can run at a time” | Workflow ID uniqueness constraint |
| Custom code to handle “API was down, retry tomorrow” | Activity retry policy with exponential backoff |
| Saga / compensating transaction code for rollback | Workflow with explicit compensation activities |
| Cron + email for “if status hasn’t changed in 48h, page someone” | Workflow with timer + signal |
These are the same patterns implemented many times in many companies. Temporal is the abstraction.
A concrete example#
Consider a hospital intake workflow: a patient arrives, intake form is captured, insurance is verified, an appointment is scheduled, reminders are sent, the visit happens, billing fires. Each step might take minutes (insurance verification API) or days (the actual appointment). Steps can fail (insurance down), require human input (front desk corrects an error), or time out (patient doesn’t confirm).
In a traditional system, this is maybe 6 services, a state table, 3 cron jobs for reminders, custom retry logic per integration, and a dashboard nobody trusts.
In Temporal, it’s one workflow:
@workflow.defn
class PatientIntake:
@workflow.run
async def run(self, intake_id: str) -> IntakeOutcome:
patient = await workflow.execute_activity(
verify_intake_form, intake_id,
schedule_to_close_timeout=timedelta(minutes=5),
)
insurance = await workflow.execute_activity(
verify_insurance, patient,
retry_policy=RetryPolicy(maximum_attempts=10, initial_interval=timedelta(minutes=1)),
schedule_to_close_timeout=timedelta(hours=24),
)
appointment = await workflow.execute_activity(
schedule_appointment, patient, insurance,
)
# Wait until 24h before appointment, then send reminder.
await workflow.sleep(appointment.start_time - timedelta(hours=24) - workflow.now())
await workflow.execute_activity(send_reminder, patient, appointment)
# Wait for the visit completion signal, or fall through after 7 days.
try:
outcome = await workflow.wait_condition(
lambda: self.visit_completed,
timeout=timedelta(days=7),
)
except asyncio.TimeoutError:
return await workflow.execute_activity(handle_no_show, intake_id)
return await workflow.execute_activity(trigger_billing, patient, appointment)
That one function survives worker restarts, retries insurance verification for up to 24 hours, sleeps for actual days waiting for the appointment time, and handles no-shows after a week. Without Temporal, this would be a small subsystem.
When Temporal earns its keep#
Temporal is genuinely transformative for workloads that have any of these properties:
- Long-running. Anything over a few minutes. Workflows that wait hours, days, or months. Most domain processes — onboarding, claims, contracts, fulfillment, training pipelines.
- Multi-step with state. “First do A, then if A returns X do B, else do C, then wait for human approval, then do D.” Imperative orchestration without the state-machine ceremony.
- Retry-heavy. Workflows that touch flaky third parties. Insurance APIs that go down for an hour. Payment gateways with rate limits. Email providers with delivery delays.
- Compensation / saga pattern. Multi-step writes that need rollback if any step fails.
- Human-in-the-loop. Pause until a human signals (approval, correction, manual override). Resume from there.
- Scheduled work that’s more complex than cron. “Every Tuesday at 9 AM, but only if the previous run completed, and only for tenants in the active tier.”
Almost every business has at least one workflow that matches several of these. The bank onboarding flow. The hospital claim lifecycle. The logistics shipment status machine. The customer trial → conversion → renewal pipeline. The AI training pipeline that takes 18 hours and fails 10% of the time.
When Temporal is overkill#
Temporal is real infrastructure. Self-hosted, it’s a stateful service backed by Cassandra/MySQL/Postgres. Managed (Temporal Cloud), it’s a per-action SaaS bill. Either way, you’re adding a serious component.
Don’t reach for Temporal when:
- The workflow fits in one HTTP request. Sub-second, no retries, no waiting. Just write the function.
- The “workflow” is a single Kafka consumer. Stream processing has its own patterns (we wrote about Kafka in production here).
- You have one cron job. A single cron job + a status table + good logging is fine. Don’t add Temporal for one process.
- Your team has zero capacity for new infrastructure. Temporal Cloud is the way to avoid this — start there, not with self-hosted.
The threshold we use: if you have three or more distinct long-running workflows in the system, Temporal pays off. Fewer than that, you can probably get away with simpler patterns.
The operational shape#
A few things to know about running Temporal in production.
Workers, not the server, do the work. The Temporal server is the scheduler and history store. Your worker processes (regular Python / Go / TS processes) actually execute workflow and activity code. You scale workers separately from the server.
Workflow code must be deterministic. No random.random(), no datetime.now(), no direct file or network access — those go in activities. Temporal enforces this at runtime via replay; if you violate it, you’ll see “non-deterministic workflow detected” errors during deploys. The discipline is real but learnable.
Versioning is the hard part. A workflow started on v1 of your code can run for weeks. If you deploy v2 and the workflow logic changed, you need explicit versioning (workflow.patched() API) to keep old executions running on old logic. This is the thing people underestimate about Temporal — get the versioning pattern right early.
Visibility is a separate read store. Workflow search uses Elasticsearch (or alternatives now). Plan for it.
Temporal Cloud vs self-host. For most teams, Temporal Cloud is the right starting point. Self-hosting Temporal at scale is non-trivial (Cassandra ops, history store sizing, the visibility ES cluster). The Cloud price is reasonable for the operational burden it removes. Self-host if you’ve outgrown Cloud’s cost or have hard data-residency requirements.
How Temporal compares#
A few tools in the same general space:
- Airflow / Dagster / Prefect. Data-pipeline orchestrators. Different shape — DAGs of batch tasks, scheduled runs, mostly Python-native. Good at “run this DAG nightly” workloads. Temporal is better at long-lived stateful workflows with human-in-the-loop and external API integration.
- AWS Step Functions. Closest comparable in terms of capability. JSON-based state machine definitions; AWS-only; managed. Step Functions is fine for AWS-native workloads where you’re OK with the JSON DSL and the AWS ecosystem lock-in. Temporal wins on: code-as-workflow (no DSL), multi-cloud, richer SDKs.
- Cadence. Temporal’s predecessor; the team that built Cadence at Uber forked it into Temporal. Cadence still exists. Temporal is the more active project.
- Conductor (Netflix). Earlier orchestrator; less momentum than Temporal these days.
What we ship by default#
For new clients with non-trivial workflow needs, we recommend:
- Temporal Cloud for the workflow engine (skip self-host until volume justifies it).
- Python or TypeScript SDK based on team language.
- One workflow per business process, with activities encapsulating side effects.
- Explicit versioning patterns from day one. Don’t wait until v2 to figure out how to evolve workflows.
- A small set of monitoring dashboards: workflows started, completed, failed, by type. Long-running counts. Activity retry rates.
For the hospital management systems and banking workflows where we deploy Temporal, the value is loud within the first month: incidents that would have required custom retry code, manual state-table fixes, or “let me write a one-off script to find stuck items” become non-events.
The pattern of patterns#
Temporal is the abstraction your team would build for the fifth time if you let them. It’s not magic; it’s a well-engineered version of code most teams have already written.
The teams that get the most out of Temporal aren’t the ones who use the most features. They’re the ones who recognize a long-running, stateful, retry-heavy workflow when they see one and resist the urge to build it from scratch one more time.
Most “orchestration” code is Temporal you wrote without knowing. If you’re building long-running workflows and finding yourself with a state-table-plus-cron-job pattern, our DevOps and platform team can help shape it. Tell us about the workflow.