RAG in Production | pdpspectra

Every team building production RAG hits the same fork in the road: where do the embeddings live? It looks like a tooling choice. It isn’t. The vector store you pick determines your operational model for the next two years — your latency budget, your reindex pain, your blast radius when the model team wants to swap embeddings.

We’ve shipped RAG on all three of pgvector, Pinecone, and Weaviate across hospital data platforms, banking knowledge bases, and internal search at logistics operators. Here’s how we actually pick.

The honest comparison#

Property	pgvector	Pinecone	Weaviate
Hosting	Your Postgres	Managed SaaS	Managed SaaS or self-host
Best at	Joining vectors with relational data	Pure ANN at huge scale	Hybrid search + schema-driven retrieval
Reindex pain	`REINDEX` blocks writes (HNSW v0.5+ helps)	Pinecone reindexes in background	Background, schema-aware
Filtering	Native SQL `WHERE`	Metadata filters (limited operators)	GraphQL filters (rich)
Hybrid search (BM25 + vector)	Bring your own	Sparse-dense in same index	Native (BlockMax WAND)
Latency p50 at 10M vectors	15-40 ms	30-80 ms	30-90 ms
Ops surface	Whatever Postgres you already run	None (you own data egress)	Real (cluster, sharding)
Pricing pattern	Postgres compute + storage	Per-namespace + read units	Self-host = infra cost; managed = per-million

There’s no “best.” There’s “best for your constraints.” Three frames make the call obvious.

Frame 1: are your vectors joined to relational data?#

If your RAG is over data that already lives in Postgres — patient records, transactions, user accounts, ERP entities — stop looking. Use pgvector.

The reason is dull and load-bearing: every interesting filter in production RAG is a WHERE clause over relational columns. “Find the 10 most similar incident reports, but only for this hospital, in the last 90 days, where status is open.” With pgvector that’s one SQL query. With Pinecone or Weaviate that’s an upstream filter, possibly an external join, and you pay round-trip latency on a query that Postgres would have done with one index scan.

-- Native to Postgres + pgvector — one query, one trip
SELECT
  id, summary, created_at,
  embedding <=> $1 AS distance
FROM incident_reports
WHERE hospital_id = $2
  AND created_at > now() - interval '90 days'
  AND status = 'open'
ORDER BY embedding <=> $1
LIMIT 10;

The same query against Pinecone needs you to either: (a) maintain Pinecone metadata that mirrors hospital_id/status/created_at and trust it’s in sync, (b) overfetch + filter in app code, or (c) hit Postgres first to get the candidate IDs and then Pinecone for the rank. All three are worse than the SQL above.

This is why every hospital management system we’ve built has its vector index inside the same Postgres that holds patients and visits. The vectors are just another column on a table that already exists. No second source of truth, no sync pipeline, no separate auth model.

Frame 2: are you at a scale where the math actually changes?#

Pinecone earns its keep when one of two things is true:

(a) You have hundreds of millions of vectors and pgvector’s HNSW index is now larger than RAM. At that point you’re managing a Postgres cluster purely to host vectors, and you might as well use a system designed for it.

(b) You need true multi-tenancy at scale — thousands of customer namespaces, each isolated. Pinecone’s namespace model is purpose-built for this. Doing it in Postgres means thousands of partitions or row-level security, and the index management gets ugly.

Below those thresholds, “we need Pinecone because it’s a vector database” is cargo-culting. Pinecone is excellent at the thing it does — but the thing it does is not “be a database.” It’s an index. If your data isn’t already in the shape Pinecone wants, you’re now running two systems with a sync problem.

Frame 3: how much hybrid search do you need?#

Pure semantic search is wrong about 25-40% of the time on long-tail technical queries. “Print the K-RECEIPT-2025-Q3 form” doesn’t embed well — the model has never seen that token. BM25 nails it.

The right answer for most knowledge bases is hybrid: BM25 for lexical recall + vector for semantic recall + rerank. The question is who owns the merge.

Weaviate does hybrid in one call. You set the alpha (lexical-vs-vector weight) and get a merged ranked list. This is genuinely good — and if your data is unstructured documents with no obvious relational home, Weaviate’s schema-first approach is the cleanest.
pgvector + Postgres FTS is hybrid you assemble yourself. A WITH lexical AS (...), semantic AS (...) SELECT ... CTE merging the two with reciprocal rank fusion is 30 lines of SQL and works fine.
Pinecone hybrid (sparse-dense) requires building sparse vectors yourself (via splade or BM25 sparse encoding) — more work than it sounds, and the merge is server-side opaque.

If hybrid is a hard requirement and you don’t already have Postgres, Weaviate is the path of least resistance. If you have Postgres, you can do it yourself for the cost of an afternoon.

What we ship by default#

For new clients building RAG, our default starting stack is:

pgvector on Postgres 16+ with HNSW indexes.
pgvector-python or pgvector-go in the app.
Reranking via Cohere Rerank or a cross-encoder in a second step (the rerank does more for accuracy than any vector store choice).
OpenAI text-embedding-3-small or local bge-large-en-v1.5 for embeddings, depending on data sensitivity.
Metric: cosine distance (<=>), unless you have a reason not to.

The reranker matters more than people expect. A top-30 retrieve from pgvector → top-5 rerank pipeline outperforms a top-5 retrieve from any vector store you can name. Spend your complexity budget there, not on the index.

We move off pgvector when one of the three frames above flips, not before. About one in five projects ends up on a dedicated vector store. The rest stay on Postgres for the lifetime of the project.

The mistakes we keep seeing#

A few patterns we audit our way out of:

Reindexing the entire corpus on every embedding model swap. If you’re swapping models often, design for it: keep both old and new embeddings in parallel columns during the cutover, A/B the rank quality, then drop the old. Don’t take downtime.
Storing 1536-dimension vectors when 384 would do. Smaller embeddings (bge-small, all-MiniLM-L6-v2) often retrieve as well as the big ones for narrow corpora, at a fraction of the storage and a query time speedup. Test on your data before defaulting to OpenAI’s largest.
No eval set. You can’t compare pgvector vs Pinecone without a golden set of (query, expected document) pairs. Build the eval first. It’s the only thing that lets you change vector stores without superstition.
Treating chunking as solved. The single biggest accuracy lever in RAG is chunk boundaries. Try four chunking strategies on your eval set before you tune anything else.

When you’ll regret each choice#

Regret pgvector when: your Postgres team is already screaming about lock contention, and the vector workload pushes them over. The HNSW build is CPU-heavy; the queries are I/O-heavy; the two patterns compete with your OLTP. At that point you’ve outgrown the “vectors as a column” mental model and need an isolated system.

Regret Pinecone when: the bill arrives. Pinecone is fairly priced for its target customer (millions of vectors, hot search). For a 200k-vector corpus serving 50 queries/min, you’re paying for capacity you’ll never use. Also, every relational filter you can’t push down becomes an over-fetch.

Regret Weaviate when: the cluster needs care. Self-hosted Weaviate is a real database — replication, sharding, GraphQL schema migrations. Managed Weaviate Cloud is fine, but you’ve now got vendor lock-in on a smaller player than Pinecone.

The meta-point#

RAG is a data engineering problem with a model bolted on. The vector store is one decision out of many — chunking, embedding model, metadata schema, reranking, eval set, observability. Picking pgvector vs Pinecone in isolation, without those other decisions made, is putting the cart in front of the horse.

The teams that ship RAG that holds up in production don’t agonize over the vector store. They build the eval set first, pick the boring default, and rebuild the parts that don’t survive contact with real traffic.

For the hospital management systems and banking knowledge bases we deploy, that boring default is pgvector. Yours might differ. But please, run an eval first.

RAG that survives a quarter of real users is mostly plumbing. If you’re stuck between a demo that wows and a system that doesn’t degrade gracefully, our AI & LLM integration service is built around that exact gap. Or tell us what’s breaking and we’ll see what fits.

The honest comparison#

Frame 1: are your vectors joined to relational data?#

Frame 2: are you at a scale where the math actually changes?#

Frame 3: how much hybrid search do you need?#

What we ship by default#

The mistakes we keep seeing#

When you’ll regret each choice#

The meta-point#

Related posts.

Plumbing-First AI: Why Implementation Is Mostly Data Engineering

Building Reliable AI for In-House Legal Teams

LangChain vs LlamaIndex: A 2026 Engineering Decision Guide