GPU Scheduling on Kubernetes

GPU scheduling on Kubernetes is one of those areas where the tooling went from “research-grade” in 2021 to “production standard” by 2025. NVIDIA’s GPU Operator, MIG (Multi-Instance GPU), time-slicing, and a maturing ecosystem of schedulers (Volcano, Kueue, Run:AI, custom solutions) make this credible at scale.

The patterns that actually work in production.

The core scheduling primitives#

Node selectors + taints. GPU-equipped nodes tainted; only GPU-requesting pods tolerate. Basic isolation.

Resource requests. Pods request nvidia.com/gpu: N. The device plugin handles allocation.

Time-slicing. Multiple pods share one physical GPU with time-multiplexed access. Useful for low-utilization workloads (notebooks, light inference).

MIG (Multi-Instance GPU). Physical partitioning on A100/H100. Each pod gets a slice with dedicated memory and compute. Better isolation than time-slicing.

Topology awareness. Some workloads need GPUs with NVLink or specific PCIe topology. Scheduler awareness matters.

The schedulers#

Default Kubernetes scheduler with NVIDIA device plugin. Fine for simple cases.

Volcano. Designed for batch workloads; fair-share, gang scheduling, preemption. Good for training jobs.

Kueue. Native Kubernetes batch scheduling. Becoming the de-facto standard for ML training on K8s.

Run:AI (Nvidia-owned). Commercial; mature features around fractional GPU sharing and team quotas.

Karpenter for node provisioning. Pairs well with any GPU scheduler — provisions the right GPU node type for the workload.

Choose by workload pattern. Notebooks and ad-hoc dev → fractional sharing. Training → batch scheduler.

The patterns that work#

Separate node pools by GPU type. A10G, A100, H100, MI300 each have different cost-performance characteristics. Pods request the type they need.

Fractional GPUs for development. A team of data scientists sharing physical GPUs via MIG or time-slicing. Cost down dramatically.

Whole GPUs for training. Training jobs get exclusive access; no sharing.

Job queues with priority. Critical training jobs preempt low-priority work. Defined explicitly.

Spot-aware scheduling. Schedulers that understand spot/preemptible behavior and re-queue on preemption.

What goes wrong#

Underutilized whole GPUs. Pod requests nvidia.com/gpu: 1 but only uses 20%. Common in development workloads. Fix with MIG or time-slicing.

Out-of-memory failures. Pod requests 1 GPU; another pod time-slices the same GPU; both OOM. Use MIG instead of time-slicing for memory isolation.

Topology misalignment. Multi-GPU training where the GPUs aren’t NVLink-connected; training is 5x slower than it should be. Topology-aware scheduling.

Driver-version mismatches. Cluster upgrades break CUDA compatibility. Pin driver versions; test before rolling.

Reservation thrashing. Quotas and queues poorly tuned; jobs sit in queue when capacity is actually available. Tune in production.

What we ship for ML platform teams#

For ML platform engagements via our DevOps automation service:

GPU node-pool architecture matched to workload mix
Kueue or Volcano deployment for batch scheduling
MIG configuration for shared workloads
Cost attribution per team and per workload type
Monitoring of GPU utilization with right-sizing recommendations
Spot integration where applicable

The cost reality#

GPU cost dominates ML budgets at scale. Modest scheduling improvements have outsized impact:

30% utilization improvement on a $200k/month GPU budget = $60k/month saved
MIG-based dev environments often reduce GPU need by 50%
Batch scheduling with preemption improves overall throughput 20–40%

The investment in scheduling discipline pays back fast.

The Kubernetes-vs-Slurm question#

For traditional HPC workloads, Slurm remains the dominant scheduler. For ML workloads in cloud-native environments, Kubernetes has won.

Teams running both:

Slurm for HPC simulation work
Kubernetes for ML training and serving
Sometimes Kubeflow-on-Slurm or vice versa for hybrid

For most ML-only environments, native Kubernetes scheduling is sufficient.

The 2026 maturity#

GPU on Kubernetes is past the bleeding-edge phase. The patterns are documented; the tooling is solid; the failure modes are understood.

For teams building ML platforms in 2026, this should be table stakes, not a research project.

GPU scheduling on Kubernetes is production-mature in 2026. The discipline determines whether you capture the cost savings. Our team builds GPU platforms for ML teams. Tell us about the workload.

The core scheduling primitives#

The schedulers#

The patterns that work#

What goes wrong#

What we ship for ML platform teams#

The cost reality#

The Kubernetes-vs-Slurm question#

The 2026 maturity#

Related posts.

Spot Instance Strategies for ML Training

MLOps Pipeline Patterns with Argo Workflows vs Metaflow

Sovereign AI and Data Residency: An Architecture Decision, Not a Checkbox