GPU Scheduling on Kubernetes

GPU scheduling on Kubernetes went from research project to production standard. The patterns that work — node selectors, time-slicing, MIG, schedulers.

GPU Scheduling on Kubernetes

GPU scheduling on Kubernetes is one of those areas where the tooling went from “research-grade” in 2021 to “production standard” by 2025. NVIDIA’s GPU Operator, MIG (Multi-Instance GPU), time-slicing, and a maturing ecosystem of schedulers (Volcano, Kueue, Run:AI, custom solutions) make this credible at scale.

The patterns that actually work in production.

The core scheduling primitives#

Node selectors + taints. GPU-equipped nodes tainted; only GPU-requesting pods tolerate. Basic isolation.

Resource requests. Pods request nvidia.com/gpu: N. The device plugin handles allocation.

Time-slicing. Multiple pods share one physical GPU with time-multiplexed access. Useful for low-utilization workloads (notebooks, light inference).

MIG (Multi-Instance GPU). Physical partitioning on A100/H100. Each pod gets a slice with dedicated memory and compute. Better isolation than time-slicing.

Topology awareness. Some workloads need GPUs with NVLink or specific PCIe topology. Scheduler awareness matters.

The schedulers#

Default Kubernetes scheduler with NVIDIA device plugin. Fine for simple cases.

Volcano. Designed for batch workloads; fair-share, gang scheduling, preemption. Good for training jobs.

Kueue. Native Kubernetes batch scheduling. Becoming the de-facto standard for ML training on K8s.

Run:AI (Nvidia-owned). Commercial; mature features around fractional GPU sharing and team quotas.

Karpenter for node provisioning. Pairs well with any GPU scheduler — provisions the right GPU node type for the workload.

Choose by workload pattern. Notebooks and ad-hoc dev → fractional sharing. Training → batch scheduler.

The patterns that work#

Separate node pools by GPU type. A10G, A100, H100, MI300 each have different cost-performance characteristics. Pods request the type they need.

Fractional GPUs for development. A team of data scientists sharing physical GPUs via MIG or time-slicing. Cost down dramatically.

Whole GPUs for training. Training jobs get exclusive access; no sharing.

Job queues with priority. Critical training jobs preempt low-priority work. Defined explicitly.

Spot-aware scheduling. Schedulers that understand spot/preemptible behavior and re-queue on preemption.

What goes wrong#

Underutilized whole GPUs. Pod requests nvidia.com/gpu: 1 but only uses 20%. Common in development workloads. Fix with MIG or time-slicing.

Out-of-memory failures. Pod requests 1 GPU; another pod time-slices the same GPU; both OOM. Use MIG instead of time-slicing for memory isolation.

Topology misalignment. Multi-GPU training where the GPUs aren’t NVLink-connected; training is 5x slower than it should be. Topology-aware scheduling.

Driver-version mismatches. Cluster upgrades break CUDA compatibility. Pin driver versions; test before rolling.

Reservation thrashing. Quotas and queues poorly tuned; jobs sit in queue when capacity is actually available. Tune in production.

What we ship for ML platform teams#

For ML platform engagements via our DevOps automation service:

  • GPU node-pool architecture matched to workload mix
  • Kueue or Volcano deployment for batch scheduling
  • MIG configuration for shared workloads
  • Cost attribution per team and per workload type
  • Monitoring of GPU utilization with right-sizing recommendations
  • Spot integration where applicable

The cost reality#

GPU cost dominates ML budgets at scale. Modest scheduling improvements have outsized impact:

  • 30% utilization improvement on a $200k/month GPU budget = $60k/month saved
  • MIG-based dev environments often reduce GPU need by 50%
  • Batch scheduling with preemption improves overall throughput 20–40%

The investment in scheduling discipline pays back fast.

The Kubernetes-vs-Slurm question#

For traditional HPC workloads, Slurm remains the dominant scheduler. For ML workloads in cloud-native environments, Kubernetes has won.

Teams running both:

  • Slurm for HPC simulation work
  • Kubernetes for ML training and serving
  • Sometimes Kubeflow-on-Slurm or vice versa for hybrid

For most ML-only environments, native Kubernetes scheduling is sufficient.

The 2026 maturity#

GPU on Kubernetes is past the bleeding-edge phase. The patterns are documented; the tooling is solid; the failure modes are understood.

For teams building ML platforms in 2026, this should be table stakes, not a research project.


GPU scheduling on Kubernetes is production-mature in 2026. The discipline determines whether you capture the cost savings. Our team builds GPU platforms for ML teams. Tell us about the workload.