GPU Cost Optimization in 2026: Spot, Reserved, On-Demand Tradeoffs
GPU costs dominate ML budgets at scale. The three pricing modes, the workloads they fit, and the rebalancing patterns.
GPU costs dominate ML budgets at substantial scale. A single H100 hour on AWS p5 runs ~$8 on-demand; B100/B200 hours run higher. For teams running substantial training or inference workloads, the GPU bill becomes the largest single line item. The three pricing modes — on-demand, reserved, and spot — produce dramatically different unit economics. Getting the mix right cuts costs 40-70% without affecting workload characteristics.
This post walks through the three modes, the workloads they fit, and the rebalancing patterns we apply at clients.
On-demand#
On-demand is the default. Pay-per-second pricing, no commitment, no risk of preemption. Easy to use; easy to overspend.
For unpredictable workloads where forecast confidence is low, on-demand is reasonable. For everything else, on-demand is the most-expensive option.
The on-demand bill should be a fallback for capacity beyond your reservations and spot, not the primary spend.
Reserved capacity#
Reserved capacity (AWS Reserved Instances, Savings Plans for Compute, GCP Committed Use Discounts, Azure Reserved Instances) trades commitment for substantial discount. One-year commitments produce 30-40% savings; three-year commitments 50-60%.
For predictable inference workloads at scale, reserved capacity is the right backbone. Determine the baseline load that runs reliably; reserve that capacity; let on-demand or spot handle the variance.
The discipline that matters:
Forecast carefully. Over-committed reservations are wasted money. We typically recommend reserving 70-80% of the baseline rather than 100% to maintain flexibility.
Layer commitment types. AWS Savings Plans (compute) cover broader compute than Reserved Instances; some workloads benefit from mixing.
Review regularly. Workloads change; reservations should follow. Annual review of reservation mix is the minimum.
Use marketplace where allowed. AWS RI Marketplace lets you offload reservations you don’t need anymore. The market is thin but useful.
Spot capacity#
Spot capacity (AWS Spot Instances, GCP Preemptible/Spot VMs, Azure Spot VMs) trades preemption risk for substantial discount. Discounts typically 60-90% off on-demand. The trade-off: instances can be reclaimed with as little as 2 minutes notice.
For workloads that tolerate interruption, spot is dramatically cheaper. The key applications:
Training workloads — checkpoint regularly, resume on preemption, accept some wasted work. Total cost typically 50-70% lower than equivalent on-demand training.
Batch inference — process queued work, retry on preemption.
Development and experimentation — for non-time-critical work.
The discipline that matters:
Checkpoint frequency. Training jobs should checkpoint at intervals where re-running the lost work is acceptable. Typically every 30-60 minutes for substantial training runs.
Resilient training framework. The training code needs to handle preemption gracefully. Modern frameworks (TorchTitan, DeepSpeed, Megatron) handle this if configured.
Capacity diversification. Don’t bid for a single instance type in a single AZ; diversify across instance types and AZs to reduce preemption likelihood.
Avoid spot for real-time inference. The preemption risk is incompatible with user-facing latency SLOs.
The hybrid pattern#
Sophisticated teams combine all three.
Baseline inference on reserved capacity. Predictable load, substantial discount.
Variance inference on on-demand. Pay for the unpredictable portion.
Training and batch inference on spot. Massive savings, acceptable preemption risk.
Development on spot. Cheap experimentation.
The mix shifts over time as workload patterns change. We typically audit the mix quarterly at clients with substantial GPU spend.
The cloud provider differences#
A few patterns specific to providers in 2026:
AWS — broadest GPU instance variety; substantial spot capacity across types; Capacity Blocks for reserved-style training capacity.
GCP — A3 Mega for training; competitive spot pricing; growing Trillium TPU alternative.
Azure — ND-series for training; competitive enterprise pricing.
Specialized providers — Lambda Labs, CoreWeave, Crusoe, Together, Vast.ai — sometimes substantially cheaper than hyperscalers for specific workloads. Operational trade-offs around feature integration.
Cerebras — for specialized workloads where CS systems make sense.
The reservation timing#
A specific consideration in 2024-2026: GPU capacity has been scarce. Reserved capacity often requires advance booking with the provider. AWS Capacity Blocks, for example, must be reserved well in advance for substantial training runs.
The implication: capacity planning has to start before the work starts. Teams that procrastinate find capacity unavailable.
What we typically see at clients#
Common patterns at GPU-heavy clients:
Over-reliance on on-demand. Teams that don’t invest in reservations or spot pay 2-3x what they should.
Spot for the wrong workloads. Real-time inference on spot produces customer-visible degradation.
Reservation mismatches. Reservations for one instance type, workload runs on another.
No regular review. Workload changed; reservation mix didn’t follow.
The fixes are usually straightforward — audit, rebalance, instrument, repeat.
Where pdpspectra fits#
Our MLOps practice includes GPU cost optimization as part of broader engagements. The discipline is straightforward; the implementation requires platform awareness.
Related reading: the FinOps cloud cost post, the quantization post, and the LLM cost optimization post.
GPU cost discipline pays substantial returns. Talk to our team about your AI infrastructure.