Quantization in Production: GPTQ vs AWQ vs Bitsandbytes in 2026
Quantization cuts memory 4x without much quality loss when done right. The three techniques worth running in production and when each fits.
Quantization is the technique that lets a 70B parameter model run on a single 24GB consumer GPU. The math is straightforward — represent weights with fewer bits — but the engineering choices are not. Different quantization techniques produce different quality-vs-speed-vs-memory trade-offs, and the choice depends on what hardware you’re targeting and what quality bar you need to clear.
This post walks through the three quantization techniques that matter in production in 2026.
GPTQ#
GPTQ (Generative Pre-trained Transformer Quantization) is the post-training quantization technique that’s been dominant since 2022. The technique uses second-order information about the loss surface to choose quantization parameters that minimize error.
The practical characteristics:
- Quality preservation is generally excellent at 4-bit. The performance drop compared to full precision is typically under 1% on standard benchmarks.
- Calibration data required — GPTQ uses a small calibration dataset (typically a few hundred to a few thousand examples) to optimize the quantization. The choice of calibration data affects the result; using representative samples for your actual workload helps.
- Inference speed is good with optimized kernels (especially through vLLM, TGI, and similar high-performance inference engines).
- Wide tooling support — most quantization-aware inference frameworks support GPTQ natively.
GPTQ is the right default for most teams quantizing LLMs for production. It’s well-supported, well-understood, and produces consistently good results.
AWQ#
AWQ (Activation-aware Weight Quantization) is the newer alternative. The key insight: not all weights matter equally for output quality. AWQ identifies the salient weights (the ones whose perturbation most affects output) and protects them while quantizing the rest more aggressively.
The practical characteristics:
- Quality preservation is comparable to GPTQ on most workloads, sometimes slightly better at very low bit-widths.
- No calibration data required — AWQ uses activation statistics from a forward pass rather than calibration samples. This simplifies the workflow.
- Inference speed is excellent with the AWQ-specific kernels (AutoAWQ and the vLLM integration).
- Memory efficiency is comparable to GPTQ.
AWQ has been gaining share since 2024 because of the simpler workflow. For teams quantizing many models, the no-calibration approach is materially less operational work.
Bitsandbytes (LLM.int8 and NF4)#
Bitsandbytes is the library that ships with the bitsandbytes Python package, integrated tightly with Hugging Face Transformers. It provides two main quantization modes:
- LLM.int8() for 8-bit quantization
- NF4 (NormalFloat 4-bit) for 4-bit quantization
The practical characteristics:
- Quality preservation is excellent for 8-bit, very good for 4-bit (NF4 typically slightly worse than well-tuned GPTQ or AWQ).
- No calibration required — the library handles quantization on the fly.
- Inference speed is slower than GPTQ/AWQ with their specialized kernels. The bitsandbytes path runs through PyTorch and is less optimized for inference throughput.
- Excellent for training — particularly QLoRA workflows, where bitsandbytes is the standard quantization layer underneath PEFT fine-tuning.
Bitsandbytes is the right choice for development workflows, fine-tuning (QLoRA), and inference scenarios where simplicity matters more than maximum throughput. For dedicated inference servers serving high traffic, GPTQ or AWQ with vLLM is typically faster.
The decision framework#
For most production teams in 2026:
Use AWQ for new inference deployments. The no-calibration workflow plus comparable quality to GPTQ makes it the operationally simpler choice.
Use GPTQ if you have specific calibration data that matters and you want maximum quality preservation. Also if you’re working with a model family that’s better supported in the GPTQ ecosystem.
Use Bitsandbytes for development, training, and lower-traffic inference where simplicity dominates. Particularly natural for QLoRA fine-tuning workflows.
The hardware story#
Quantization choice interacts with hardware. NVIDIA’s recent GPUs (H100, H200, B100, B200) have specialized hardware support for FP8 and INT4 that some quantization techniques exploit better than others. AMD’s MI300 series has similar but distinct hardware support.
For most teams running on rented GPU instances, the abstraction layer (vLLM, TGI, TensorRT-LLM) handles this. For teams running on owned hardware with specific accelerator characteristics, the quantization-hardware fit matters more.
The quality measurement#
Whatever technique you pick, measure the quality cost. Run your actual evaluation suite against the quantized model. Don’t trust headline benchmark numbers from papers; they may not match your workload.
Specific things to check: instruction following, factual accuracy on domain-relevant questions, output format consistency, refusal behavior, tool use correctness. Different quantization techniques produce different failure modes; your evaluation should catch the modes that matter for your application.
Where pdpspectra fits#
Our MLOps practice builds production LLM inference platforms with appropriate quantization for the workload. The technique choice is one of many engineering choices we make on behalf of clients.
Related reading: the sub-100ms inference post, the GPU cost spot vs reserved post, and the PEFT LoRA fine-tuning post.
Quantization is now standard production practice. Talk to our team about your inference platform.