Spot Instance Strategies for ML Training
Spot instances cut ML training costs 60–80% — when used with the right discipline. The patterns that make spot work for serious training jobs.
Spot/preemptible instances offer 60–80% cost savings for ML training, with the caveat that the cloud can reclaim them with minutes of notice. The discipline to use them productively is well-understood; teams that don’t apply it run on-demand and overspend.
The patterns that make spot work for serious training.
Where spot earns its place#
Multi-day training jobs. The savings amortize over the duration.
Hyperparameter sweeps. Many parallel jobs; loss of one matters less.
Distillation and fine-tuning. Often shorter; preemption tolerable.
Batch inference. Latency-tolerant; spot fits cleanly.
Data preprocessing. Often the heaviest cost in a pipeline; spot capacity is well-suited.
Where it doesn’t#
Latency-critical inference serving. Preemption mid-request is a customer event.
Training jobs that don’t checkpoint. A multi-day run that loses progress on preemption costs more than it saves.
Workloads with cluster-wide synchronization. Distributed training where one preempted node breaks the cluster.
The disciplines#
Frequent checkpointing. Every 10–30 minutes for long-running training. The lost work on preemption is bounded.
Graceful preemption handling. When the spot warning arrives (typically 2 minutes notice), trainer saves state, releases resources cleanly.
Job orchestration that re-queues preempted jobs. Kubernetes + Argo, AWS Batch, Vertex AI Pipelines all handle this. Don’t build it yourself.
Diversified instance type bidding. Use multiple instance types and zones. A single instance type going scarce won’t kill your job.
On-demand fallback. When spot is unavailable for extended periods, fall back to on-demand. Set a budget cap.
The architecture#
For a credible ML training stack on spot:
- Job scheduler (Argo, Kubeflow, Vertex AI Pipelines, SageMaker training jobs) handling spot/on-demand transitions
- Object storage for checkpoints with versioning
- Distributed training framework (Ray, PyTorch DDP, DeepSpeed) that checkpoints natively
- Monitoring to track preemption rates and adjust strategy
- Cost telemetry showing actual vs theoretical savings
What the savings actually look like#
Reported numbers vary; real numbers we’ve measured:
- 60–75% savings on multi-day training on common GPU types
- 30–50% savings on shorter jobs (preemption tax higher)
- 50–70% savings on data preprocessing
- 70–85% savings on hyperparameter sweeps
The deeper the savings, the longer the job and the more preemption-tolerant the workload.
Multi-cloud spot#
For teams running on multiple clouds:
- AWS Spot, GCP Preemptible/Spot, Azure Spot have different price dynamics
- Cross-cloud spot orchestration is non-trivial but available (commercial schedulers, kubernetes-based)
- The marginal cost saving rarely justifies the operational complexity
For most teams, deep on one cloud’s spot beats broad across multiple.
What we ship for ML teams#
For ML infrastructure engagements via our DevOps automation service:
- Spot-first training architecture
- Checkpointing discipline in trainer code
- Graceful preemption handling
- Cost monitoring with spot-vs-on-demand attribution
- Documented runbooks for the team
The cost-discipline context#
Spot instances are one lever in a broader FinOps for AI workloads practice. Used in isolation, they save money. Combined with the other disciplines (caching, model routing, batch sizing), the compound effect is much larger.
The honest tradeoff#
Spot adds engineering complexity. For teams without ML infrastructure maturity, on-demand is fine for the first 6–12 months. The shift to spot earns its place when:
- Training costs become material to the budget
- The team has built reliable checkpointing
- The workloads are predictable enough to plan spot vs on-demand mix
Premature optimization on spot when the team doesn’t have the discipline produces flaky training pipelines that the team eventually rolls back.
Spot for ML training works with the right discipline. Without it, the savings turn into reliability problems. Our team builds ML training infrastructure on spot for production teams. Tell us about the workload.