Open-Source LLMs in Production: Llama, Mistral, Gemma — When and Why
Open-source LLMs caught up enough to ship to production. The honest tradeoffs vs hosted APIs and the patterns that make self-hosted work.
The “open-source LLMs aren’t ready for production” line stopped being true around mid-2025. Llama 3.3 70B, Mistral Large 2, Qwen 2.5 72B, and Gemma 3 27B all clear the bar for serious enterprise workloads — not for every use case, but for plenty. The conversation has shifted from “can OSS models be used in production?” to “for which workloads is OSS the right call, and what does serving them actually cost?”
We’ve shipped both hosted-API and self-hosted-OSS deployments across hospital, banking, and SaaS clients. Here’s the honest picture in 2026.
What “production-ready” actually means for OSS LLMs#
A production-grade OSS deployment requires four things hosted APIs give you for free:
- Capable model weights (Llama, Mistral, Gemma, Qwen, etc.)
- Serving infrastructure (vLLM, TGI, SGLang, TensorRT-LLM)
- Operational discipline (monitoring, capacity planning, model lifecycle)
- Cost and latency at acceptable levels
The model weights checkbox is solved. The other three are where teams underestimate the work.
The model landscape in 2026#
Here are the OSS models we’d actually consider shipping:
| Model family | Sizes | Strengths | Operational fit |
|---|---|---|---|
| Llama 3.3 / 4 | 8B, 70B, 405B | Strong all-rounder; biggest ecosystem; widest tooling | Llama 3.3 70B fits on 2× H100 (FP8); 405B is multi-GPU only |
| Mistral Large 2 / Codestral | 123B (Large), 22B (Codestral) | Strong reasoning; great for code workloads | Large is 2-4 H100s; Codestral fits single H100 |
| Gemma 3 | 1B, 4B, 12B, 27B | Best of the smaller-model tier; mobile/edge friendly | Gemma 27B fits single H100 |
| Qwen 2.5 / 3 | 7B, 14B, 32B, 72B | Strong multilingual; underrated for non-English workloads | 72B is 2 H100s |
| DeepSeek V3 / R1 | 671B MoE | Reasoning-tuned; very strong on code + math | MoE means smaller active params (~37B) but big VRAM needs |
| Phi-4 | 14B | Microsoft’s small-model bet; strong for size | Single GPU, light footprint |
The Llama-3.3-70B and Gemma-3-27B class is what we recommend for most enterprise workloads. Frontier models (Claude Opus, GPT-5, o3) still outclass them on the hardest reasoning tasks, but for the bulk of production AI work — classification, extraction, summarization, RAG-grounded Q&A, structured output — OSS is genuinely competitive.
When OSS makes sense#
Data residency requirements. Healthcare and finance clients in regulated jurisdictions often can’t send patient or transaction data to OpenAI/Anthropic’s US servers. OSS hosted on your own infrastructure (or in-country data center) solves this cleanly.
High-volume, predictable workloads. Inference cost economics flip when you have steady high volume. Self-hosted Llama 70B on dedicated H100s starts beating hosted-API costs at roughly 10M+ tokens/day for that model class.
No-egress-to-third-parties constraints. Some compliance regimes require all data processing to stay within your network. Hosted APIs (even with PrivateLink) technically egress data to the provider; self-hosted doesn’t.
Specialized fine-tuned models. If you’ve fine-tuned a model for your domain, you need to host it. Hosted APIs don’t take arbitrary custom weights (with rare exceptions like OpenAI’s fine-tuning, which is limited to a couple of base models).
Cost certainty. Self-hosted has a predictable monthly bill (the GPU cost). Hosted APIs scale with usage which can spike unexpectedly. For workloads where cost variance matters, fixed cost is easier to plan around.
Vendor independence and longevity. OSS weights don’t get sunset. The Llama 3.3 you deploy today will be running unchanged in 5 years if you want. Hosted models get deprecated on the provider’s timeline.
When OSS doesn’t make sense#
You need the frontier. GPT-5, Claude Opus 4.7, o3 — the very top of capability is still proprietary. If your workload genuinely needs that ceiling (complex reasoning, multi-step agent work, hardest code tasks), OSS will disappoint.
Low-volume sporadic usage. Running an H100 24/7 for 100k tokens/day is wasteful. Hosted APIs amortize their compute across millions of customers; you pay per use.
Multimodal workloads. OSS vision-language and audio models exist (Llama 4 vision, Whisper for STT) but ecosystem is thinner. For unified text + vision + audio in one provider, hosted APIs win.
Small team without infrastructure capacity. Self-hosting a 70B model means: GPU procurement, vLLM/SGLang ops, monitoring, capacity planning, security patching, model lifecycle. If your team can’t sustain that, OSS is more pain than benefit.
Spiky traffic patterns. Bursting from 100 RPM to 10,000 RPM in 30 seconds is what hosted APIs are designed for. Self-hosted needs careful capacity planning, autoscaling, and queue management.
The serving stack: vLLM, TGI, SGLang, TensorRT-LLM#
What you actually run the model on:
- vLLM is the default we recommend. PagedAttention for memory efficiency, continuous batching for throughput, strong community, supports almost every OSS model within days of release. Easy to deploy, easy to operate. Works on NVIDIA, AMD (with caveats), and is being ported to other accelerators.
- TGI (Hugging Face Text Generation Inference) — solid alternative, tightly integrated with HF ecosystem. Less aggressive on perf optimization than vLLM but easier if you’re already in HF.
- SGLang — newer, faster than vLLM for some workloads (especially structured generation), smaller community. Worth evaluating for high-throughput specialized cases.
- TensorRT-LLM — NVIDIA’s optimized stack. Fastest for NVIDIA-only deployments where the operational complexity is worth it. Real engineering work to set up.
Our default: vLLM for most deployments, SGLang when structured-output throughput matters, TensorRT-LLM only for unusually large deployments.
Where to run it#
Self-managed Kubernetes with GPU node pool. Most flexibility, most ops work. We deploy this for clients with existing K8s platforms and the team to operate it.
Managed inference (Modal, Replicate, Together, Fireworks, Anyscale). Pay-per-use for OSS models with serverless or low-touch hosting. Closes the gap between “fully self-hosted” and “hosted API” — gives you OSS-model choice without all the infrastructure work. Strong fit for teams that want OSS without GPU ops.
Cloud-managed (SageMaker Real-Time Inference, GCP Vertex AI Endpoints, Azure ML Online Endpoints). Cloud-native serving for OSS models. Tighter cloud integration; less flexibility than self-managed.
Provider-hosted OSS (Bedrock Llama, Bedrock Mistral). AWS Bedrock hosts Llama 3 and Mistral models as if they were first-party. You get OSS-model choice with AWS billing and PrivateLink. Best of both worlds for AWS-native clients.
On-prem GPU. Air-gapped or strictly-on-prem deployments. Real engineering investment. We do this for the most regulated clients (e.g., banks where even cloud-hosted-in-country isn’t acceptable).
The cost math#
Rough numbers for a Llama 3.3 70B deployment serving ~5M tokens/day:
- AWS Bedrock (Llama 70B): ~$1.50–$2.50/day at on-demand pricing
- Hosted (Together / Fireworks): ~$1.00–$2.00/day at standard tiers
- Self-managed on 2× H100 spot:
$15/day in GPU cost ($0.30/hr × 2 × 24 — though spot pricing varies) - Self-managed on 2× H100 reserved/owned: ~$20-50/day amortized
- Bedrock Provisioned Throughput (Llama 70B): ~$20-40/day for committed capacity
Self-hosted is more expensive than hosted APIs at low volume. The crossover point varies, but rough rule: above ~50M tokens/day, self-hosted starts to win on cost. Below that, hosted is cheaper and operationally simpler.
The non-obvious cost driver: GPU utilization. A self-hosted instance running at 5% capacity costs the same as one at 95%. Hosted APIs only charge for what you use.
Quantization, distillation, and the small-model trade#
Two trends matter:
Quantization (FP8, INT8, AWQ, GPTQ) lets you run bigger models on smaller hardware. Llama 70B in FP16 needs ~140GB VRAM (2 H100s); in FP8 needs ~70GB (1 H100). Quality loss is usually small (1-3% on benchmarks) and acceptable for most production workloads.
Distillation and small specialized models. Increasingly, the right answer isn’t “use the biggest model” — it’s “fine-tune a smaller specialized model for your specific task.” A fine-tuned Llama 8B on your domain often beats a generic Llama 70B for the specific task it was trained on, at 1/10th the compute cost.
For hospital management systems we deploy, we often run a small fine-tuned Llama 3 8B for clinical-NLP extraction tasks (specific, high-volume, well-defined) and a larger model (Claude via Bedrock, or Llama 70B self-hosted) for the open-ended reasoning tasks. Right tool for the right job.
What we deploy by default#
For new client engagements with serious AI workloads:
- First evaluate hosted APIs (Bedrock, OpenAI, Anthropic). Cheaper and easier 80% of the time.
- Switch to self-hosted OSS when one of the triggers fires: data residency hard requirement, high steady volume, fine-tuned model needed, vendor independence requirement.
- Default OSS serving stack: vLLM on Kubernetes with H100 / H200 node pool, autoscaling on queue depth.
- Default model: Llama 3.3 70B (FP8 quantized) or Gemma 3 27B for most workloads. Llama 8B fine-tuned for high-volume narrow tasks.
- Managed inference (Modal, Together, Fireworks) as the middle path when full self-hosting is overkill but hosted APIs don’t fit.
For our AI & LLM integration service, we let the workload pick the approach. Hosted-first, OSS when the case is strong, self-hosted when the case is overwhelming.
The thing OSS hosting doesn’t solve#
Self-hosting OSS models doesn’t make you faster, cheaper, or safer by default. It makes you:
- Responsible for capacity planning (no auto-scale to infinity)
- Responsible for model upgrades (no automatic improvements)
- Responsible for monitoring (no provider dashboard)
- Responsible for security patching
- Responsible for cost optimization (idle GPUs cost real money)
For teams without this capacity, OSS is a step backwards. For teams with it, OSS opens up workloads hosted APIs can’t address.
The right call isn’t ideological. It’s about whether your specific workload, team, and compliance constraints align with what self-hosting gives you.
The pattern of patterns#
OSS LLMs in 2026 are real production infrastructure — but they’re infrastructure, not a shortcut. The teams that get the most value out of them treat them like databases: deliberate capacity planning, monitoring, security, upgrade paths.
The teams that struggle treat OSS as “free LLMs” and find out at month 6 that they’ve built a moderate-quality service at significant operational cost while missing every model improvement the hosted providers shipped. For the enterprise-level decision of where OSS fits in a multi-team rollout, see our enterprise AI rollout roadmap.
OSS LLMs in production is infrastructure work, not a shortcut. If you’re evaluating whether self-hosted fits your workload, our AI & LLM integration team has shipped both paths. Tell us about the workload.