PEFT and LoRA: Fine-Tuning Without the GPU Bill
Parameter-efficient fine-tuning lets you customize 70B models on a single A100. The patterns, the libraries, and the math.
Parameter-efficient fine-tuning (PEFT) substantially changed the economics of model customization. Substantial 70B models that would substantially require substantial multi-node GPU clusters for full fine-tuning can be substantially adapted on a single A100 via LoRA (Low-Rank Adaptation) or its variants. The substantial 2024-2026 maturation of PEFT libraries (PEFT, Unsloth, Axolotl, plus the various) made the substantial pattern substantially production-ready. This post walks through the patterns, libraries, and the math.
What PEFT does#
The substantial PEFT approach:
Substantial freeze substantial base model weights. Substantial original substantial weights stay unchanged.
Substantial add substantial small trainable adapter modules. Substantial substantial small additions (LoRA adapters substantial typically 0.1-1% of model parameters).
Substantial substantially train substantial adapters. Substantial substantial training is substantial fast and substantial substantial memory-efficient.
Substantial substantially substantial substantial deploy substantial substantial adapter alongside substantial substantial base model. Substantial substantial inference substantial substantially combines.
The substantial substantial result: substantial fine-tuning that substantial substantially performs comparably to substantial substantial full fine-tuning at substantial substantially fraction of substantial substantially compute and substantial substantially storage.
The substantial LoRA math#
LoRA approximates substantial substantial weight updates as substantial low-rank substantial substantial matrix products.
Substantial weight update ΔW = BA where substantial B is substantial substantial d×r and substantial A is substantial substantially r×d, substantial r is substantial low rank (typically 8-64).
Substantial substantial substantial parameter count. Substantial substantial original substantial layer is substantial substantial d×d parameters; substantial LoRA adapter is substantial substantially 2dr — substantial substantial much smaller for substantial small r.
Substantial substantial substantial training cost. Substantial substantially proportional to substantial substantial adapter parameter count, not substantial substantially original model parameter count.
Substantial substantial inference cost. Substantial substantial deployment options:
- Substantial substantial keep adapter separate; substantial substantial inference computes substantial substantially BA at substantial each layer (substantial small overhead)
- Substantial substantially merge adapter into substantial substantially base weights; substantial substantially no inference overhead
QLoRA — substantial substantial quantization plus LoRA#
QLoRA substantial substantially combines substantial substantial 4-bit quantization with LoRA.
Substantial substantial 4-bit substantial quantize substantial base model. Substantial substantially reduces substantial substantial memory substantial substantially 8x.
Substantial substantially train LoRA adapter on substantial substantially quantized model.
Substantial substantial result: Substantial substantial 70B model fine-tunable on substantial substantial single A100 (80GB).
The substantial substantial economics are substantial substantial dramatic — substantial substantially compute that substantial substantially cost substantial substantial $10K+ now costs substantial substantially $100s.
The substantial libraries#
Substantial substantial PEFT library ecosystem:
Substantial Hugging Face PEFT. Substantial substantial reference library; substantial substantial broad support.
Substantial substantial Unsloth. Substantial substantial faster training; substantial substantial memory-efficient.
Substantial substantial Axolotl. Substantial substantial training framework with substantial substantial substantial config-driven approach.
Substantial substantial LLaMA-Factory. Substantial substantial alternative training framework.
Substantial substantial torchtune. Substantial substantial PyTorch-native fine-tuning.
Substantial substantial OpenAI fine-tuning API for substantial substantial commercial fine-tuning.
Substantial substantial Anthropic fine-tuning (substantial enterprise tier).
Substantial substantial AWS Bedrock fine-tuning.
The substantial production patterns#
Several substantial production patterns:
Substantial substantial adapter library. Substantial substantially deploy substantial substantial multiple LoRA adapters with substantial substantial single base model; substantial substantial swap adapters per request.
Substantial substantial substantial multi-tenant fine-tuning. Substantial substantial substantial per-tenant adapters on substantial substantial shared base.
Substantial substantial substantial substantial domain adapters. Substantial substantial substantial different adapters for substantial different domains; substantial substantial route requests.
Substantial substantial substantial substantial substantial substantial substantial inference frameworks (vLLM, TGI, plus various) substantial substantial substantial increasingly support substantial LoRA adapter serving.
The decision framework#
For most teams in 2026:
Use PEFT/LoRA for substantial substantial cost-effective fine-tuning. Substantial substantial default modern choice.
Use QLoRA when substantial substantially substantial memory-constrained.
Use full fine-tuning for substantial substantial highest-quality requirements when budget allows; substantial substantially rarely necessary.
Use commercial fine-tuning APIs when substantial substantially convenience matters more than cost.
Use prompt engineering / RAG first; substantial fine-tuning when substantial substantially insufficient.
What we typically see at clients#
Common patterns:
Substantial RAG-first deployments with substantial substantially no fine-tuning.
Substantial LoRA adapters for substantial specific format/style customization.
Substantial QLoRA at substantial cost-conscious deployments.
Substantial commercial fine-tuning at substantial substantially convenience-anchored deployments.
Where pdpspectra fits#
Our MLOps practice builds production ML systems with substantial appropriate fine-tuning strategies.
Related reading: the continual pre-training vs fine-tuning post, the quantization post, and the GPU cost post.
PEFT/LoRA substantial cost economics enable substantial customization. Talk to our team about your AI customization strategy.