GPU Scheduling on Kubernetes
GPU scheduling on Kubernetes went from research project to production standard. The patterns that work — node selectors, time-slicing, MIG, schedulers.
GPU scheduling on Kubernetes is one of those areas where the tooling went from “research-grade” in 2021 to “production standard” by 2025. NVIDIA’s GPU Operator, MIG (Multi-Instance GPU), time-slicing, and a maturing ecosystem of schedulers (Volcano, Kueue, Run:AI, custom solutions) make this credible at scale.
The patterns that actually work in production.
The core scheduling primitives#
Node selectors + taints. GPU-equipped nodes tainted; only GPU-requesting pods tolerate. Basic isolation.
Resource requests. Pods request nvidia.com/gpu: N. The device plugin handles allocation.
Time-slicing. Multiple pods share one physical GPU with time-multiplexed access. Useful for low-utilization workloads (notebooks, light inference).
MIG (Multi-Instance GPU). Physical partitioning on A100/H100. Each pod gets a slice with dedicated memory and compute. Better isolation than time-slicing.
Topology awareness. Some workloads need GPUs with NVLink or specific PCIe topology. Scheduler awareness matters.
The schedulers#
Default Kubernetes scheduler with NVIDIA device plugin. Fine for simple cases.
Volcano. Designed for batch workloads; fair-share, gang scheduling, preemption. Good for training jobs.
Kueue. Native Kubernetes batch scheduling. Becoming the de-facto standard for ML training on K8s.
Run:AI (Nvidia-owned). Commercial; mature features around fractional GPU sharing and team quotas.
Karpenter for node provisioning. Pairs well with any GPU scheduler — provisions the right GPU node type for the workload.
Choose by workload pattern. Notebooks and ad-hoc dev → fractional sharing. Training → batch scheduler.
The patterns that work#
Separate node pools by GPU type. A10G, A100, H100, MI300 each have different cost-performance characteristics. Pods request the type they need.
Fractional GPUs for development. A team of data scientists sharing physical GPUs via MIG or time-slicing. Cost down dramatically.
Whole GPUs for training. Training jobs get exclusive access; no sharing.
Job queues with priority. Critical training jobs preempt low-priority work. Defined explicitly.
Spot-aware scheduling. Schedulers that understand spot/preemptible behavior and re-queue on preemption.
What goes wrong#
Underutilized whole GPUs. Pod requests nvidia.com/gpu: 1 but only uses 20%. Common in development workloads. Fix with MIG or time-slicing.
Out-of-memory failures. Pod requests 1 GPU; another pod time-slices the same GPU; both OOM. Use MIG instead of time-slicing for memory isolation.
Topology misalignment. Multi-GPU training where the GPUs aren’t NVLink-connected; training is 5x slower than it should be. Topology-aware scheduling.
Driver-version mismatches. Cluster upgrades break CUDA compatibility. Pin driver versions; test before rolling.
Reservation thrashing. Quotas and queues poorly tuned; jobs sit in queue when capacity is actually available. Tune in production.
What we ship for ML platform teams#
For ML platform engagements via our DevOps automation service:
- GPU node-pool architecture matched to workload mix
- Kueue or Volcano deployment for batch scheduling
- MIG configuration for shared workloads
- Cost attribution per team and per workload type
- Monitoring of GPU utilization with right-sizing recommendations
- Spot integration where applicable
The cost reality#
GPU cost dominates ML budgets at scale. Modest scheduling improvements have outsized impact:
- 30% utilization improvement on a $200k/month GPU budget = $60k/month saved
- MIG-based dev environments often reduce GPU need by 50%
- Batch scheduling with preemption improves overall throughput 20–40%
The investment in scheduling discipline pays back fast.
The Kubernetes-vs-Slurm question#
For traditional HPC workloads, Slurm remains the dominant scheduler. For ML workloads in cloud-native environments, Kubernetes has won.
Teams running both:
- Slurm for HPC simulation work
- Kubernetes for ML training and serving
- Sometimes Kubeflow-on-Slurm or vice versa for hybrid
For most ML-only environments, native Kubernetes scheduling is sufficient.
The 2026 maturity#
GPU on Kubernetes is past the bleeding-edge phase. The patterns are documented; the tooling is solid; the failure modes are understood.
For teams building ML platforms in 2026, this should be table stakes, not a research project.
GPU scheduling on Kubernetes is production-mature in 2026. The discipline determines whether you capture the cost savings. Our team builds GPU platforms for ML teams. Tell us about the workload.