PyTorch in Production

PyTorch dominates the research and training side of deep learning. The production side is a different problem with a different set of patterns. The model that converges beautifully in a Jupyter notebook is the easy part. The model that serves 50 requests/second with consistent p99 latency and doesn’t memory-leak after 12 hours is the hard part.

We’ve shipped PyTorch in production for hospital imaging triage, banking NLP for compliance, and supply chain forecasting. Here are the patterns we apply on every PyTorch model that has to survive real traffic.

Pattern 1: TorchScript or ONNX for inference, not raw PyTorch#

The PyTorch you train with isn’t the PyTorch you should serve with. Python’s GIL, the autograd overhead, and the dynamic graph nature of PyTorch are great for research and terrible for inference latency.

Two paths:

TorchScript (torch.jit.script or torch.jit.trace) converts your model to a graph representation that runs in the C++ runtime. Serializes to a .pt file. Faster inference, no Python dependency for serving.

ONNX exports to the cross-framework standard. Run inference via ONNX Runtime, which is generally faster than TorchScript and works in more deployment environments (mobile, edge, non-Python serving).

We default to ONNX for new production deployments. Better runtime perf, broader tooling ecosystem, no PyTorch-version lock between training and serving.

The catch: not all PyTorch ops convert cleanly. Custom layers, dynamic control flow, and certain operations need refactoring. Test the export early — don’t discover this two weeks before launch.

Pattern 2: Half-precision and mixed precision#

For most inference workloads, FP16 (half precision) is 2x faster than FP32 with negligible accuracy loss. INT8 quantization is another 2-4x faster with a small (often under 1%) accuracy hit.

Three options:

torch.cuda.amp for automatic mixed precision during inference (and training).
torch.quantization for INT8 post-training quantization.
TensorRT for NVIDIA GPUs — auto-optimizes including precision selection. Often the fastest path for production inference on NVIDIA.

We always benchmark FP16 vs FP32 on the actual production model + workload before shipping. The 2x speedup is real money in serving costs.

Pattern 3: Batch inference even in “real-time”#

A request-per-inference pattern wastes GPU. Even “real-time” workloads benefit from dynamic batching.

Pattern: a request comes in, gets enqueued. A batcher waits up to N ms (configurable: 5-50ms depending on latency budget) to collect more requests, then runs them as a single batch through the model.

This trades ~10ms latency for ~5-10x throughput. For most workloads, the throughput gain matters more than the latency cost.

Tools: NVIDIA Triton Inference Server has dynamic batching built in. BentoML’s Runners do similar. If you’re rolling your own, the asyncio.Queue + timed batcher pattern is ~50 lines.

Pattern 4: Memory management is not automatic#

Long-running PyTorch services have a tendency to grow memory. Causes:

Tensors held in references that should have been released (gradient tensors during inference are the classic — wrap inference in with torch.no_grad():)
The CUDA caching allocator not releasing memory back to the OS even after torch.cuda.empty_cache()
Variable-shape inputs causing the CUDA allocator to fragment

Defensive patterns:

torch.no_grad() around every inference call. Non-negotiable.
torch.inference_mode() (PyTorch 1.9+) — even better, more aggressive than no_grad(). Use it.
Static input shapes where possible. Padding to fixed sizes is often worth the wasted compute.
torch.cuda.empty_cache() on a schedule (every N requests) if you see memory creep.
Process recycling: serve N requests per worker, then restart. Crude but effective.

For long-running services that absolutely cannot tolerate restarts, you’ll spend real time on memory profiling. PyTorch Profiler + nvidia-smi are your tools.

Pattern 5: Reproducible model artifacts#

Every production model deployment needs:

The model weights (state_dict or scripted .pt)
The model architecture code at the version that produced those weights
The preprocessing code (tokenizer, image transforms, etc.)
The metadata (training data version, hyperparameters, eval scores)

The serialization that works:

Save with torch.save(model.state_dict(), 'model.pt') plus a separate file for architecture.
Or save the scripted model (torch.jit.save(scripted_model, 'model.pt')) — includes architecture + weights together.
For HuggingFace models, model.save_pretrained() and tokenizer.save_pretrained() to a directory; load with from_pretrained().

Track which model version is serving which traffic. MLflow / Weights & Biases / SageMaker Model Registry all do this. See our MLflow vs W&B piece for the tooling choice.

Pattern 6: Health checks that actually test inference#

A health endpoint that returns “OK” because the HTTP server is alive tells you nothing about whether the model works. We always include:

Liveness: server process is responsive (lightweight HTTP check).
Readiness: model loaded, GPU memory available, one canary inference succeeds (heavier check, runs less often).
Periodic synthetic inference: every 60s, run a fixed test input through the model, alarm if output drifts from expected or latency exceeds threshold.

This catches: CUDA errors, model file corruption after deploy, OOM conditions, model output drift from a model swap.

Pattern 7: Observability per-prediction#

Every production inference should log:

Input shape + content hash (don’t log full input for privacy)
Output (full response, including confidence/scores)
Latency, broken down: queue wait, batching wait, model forward pass, post-processing
Model version
Request context (user, feature, tenant)

This is the same observability discipline as LLM observability applied to non-LLM models. You can’t debug “the model got it wrong” without the per-prediction trace.

What we deploy by default#

For a new PyTorch production deployment:

Training: PyTorch on whatever’s available (cloud GPU, on-prem cluster). Experiment tracking via MLflow.
Export: ONNX with FP16 quantization. TorchScript as a fallback if ONNX conversion has issues.
Serving: NVIDIA Triton Inference Server for high-throughput GPU workloads, FastAPI + ONNX Runtime for CPU workloads. See our model serving piece for the broader serving stack choice.
Container: minimal CUDA base image, model weights baked in (or pulled from object store on startup).
Deployment: Kubernetes Deployment with HPA, dedicated GPU node pool, taints/tolerations to keep general workloads off GPU nodes.
Monitoring: per-prediction logs to ClickHouse, aggregated metrics to Prometheus, drift detection sidecar.

For models running hospital imaging triage or banking fraud scoring, this stack handles meaningful production volume reliably.

What we strip out of every PyTorch demo#

A few patterns we cut from impressive demos before they go to production:

torch.cuda.is_available() checks scattered through code. Make it a config / dependency injection concern. Application code shouldn’t care about device.
Loading the model on every request. Load once at startup, reuse for every inference.
Returning raw tensors from API endpoints. Convert to lists/floats/etc. at the boundary. Don’t leak PyTorch types.
Training loop code paths in inference services. Optimizer initialization, loss computation, gradient functions — none of these should exist in serving code.
Multiprocessing for inference. PyTorch + multiprocessing + CUDA is a footgun. Use asyncio + GPU batching instead.

The thing that’s still hard#

The patterns above cover the engineering side. They don’t solve:

Concept drift detection: the model was great on last quarter’s data; is it still great?
Performance regression after retraining: the new model has better aggregate metrics but worse performance on the segment that matters most.
Calibration: model says 0.7 confidence but is actually right 50% of the time.
Adversarial robustness: the model breaks on inputs that are imperceptibly different from training data.

These are ML-discipline problems, not engineering problems. The patterns above keep the model serving; the discipline is what keeps it serving well.

The pattern of patterns#

PyTorch in production is mostly normal production engineering, with a layer of model-specific concerns (memory, batching, quantization) bolted on. The teams that ship PyTorch reliably aren’t the ones with the cleverest model architectures. They’re the ones who treated the serving infrastructure with the same discipline they’d apply to any production system.

The model is the interesting part. The serving infrastructure is the boring part. The boring part is what determines whether your AI feature works at 3am.

Production AI is mostly production engineering. PyTorch doesn’t get a pass. If you’re building PyTorch services that need to serve real traffic, our ML & MLOps team has shipped this pattern across healthcare, finance, and logistics. Tell us about the workload.