Back to Blog
DatabaseMar 09, 2026·11 min read·Ribbsaeter Systems Engineering

Scaling PostgreSQL with pgvector for RAG: Index Strategy, Partitioning and HNSW Tuning

Index strategies, partitioning patterns, and HNSW tuning for retrieval-augmented generation systems serving millions of vectors at sub-100 ms p95 latency on commodity Postgres.

Key takeaways

  • 01Most teams reaching for Pinecone or Weaviate would have been fine on Postgres with pgvector.
  • 02HNSW is the default; IVFFlat is a fallback for write-heavy or memory-constrained workloads.
  • 03Hybrid retrieval (BM25 + vector) is more accurate and cheaper than pure ANN.
  • 04Partition by tenant for multi-tenant systems; partition by time for content that ages out.
  • 05Measure recall and latency on your data — synthetic benchmarks lie.

Why Postgres for RAG

The decision to add a separate vector database is one of the most common premature complexity choices in 2026 RAG architectures. pgvector is fast, mature, and lives in the database that already holds your application data — which means atomic writes, joinable retrieval, simple backup, and one fewer system to operate. We have built RAG systems on Postgres for legal research, customer support and code search at tens of millions of vectors and sub-100 ms p95.

When you should not use Postgres

  • More than ~100 million vectors per index, with strict sub-50 ms latency at hundreds of QPS.
  • Frequent index rebuilds (more than a few times per day) on multi-million row datasets.
  • Embeddings stored separately from any relational metadata you would join against.

Choosing the index: HNSW vs. IVFFlat

PropertyHNSWIVFFlat
Default for most workloadsYesNo
Build timeSlowFast
Query latencyLowerHigher (with same recall)
Memory footprintHigherLower
Update behaviourGoodRequires periodic re-train
Recall tunabilityef_searchprobes

Default to HNSW for read-heavy RAG. Use IVFFlat when you need fast bulk loads, when memory is tight, or when you are happy to re-build the index after a large write batch.

Tuning HNSW that actually moves the numbers

  1. Build with m=16 and ef_construction=200 as a sane default. Increase m to 32 for higher recall on harder distributions.
  2. Set ef_search per query, not globally. Lower it (40-80) for live user queries; raise it (200+) for re-ranking pipelines.
  3. Pre-warm the index after a deploy: SELECT * FROM your_table ORDER BY embedding <=> 'random_vector' LIMIT 1; will pull pages into shared_buffers.
  4. Size shared_buffers so that the index fits in memory. Index size is roughly (m * 8 + dim * 4) * rows. A 5M x 1536 dim index is ~30 GB.
  5. Use maintenance_work_mem of at least 8 GB during index builds — it is the single biggest factor in build time.

Dimension reduction without losing the plot

Most embedding models you would use in 2026 emit 1536 to 3072 dimensions. Halving the dimensionality with Matryoshka representation learning (MRL) — supported natively by OpenAI text-embedding-3 and Cohere embed-v3 — cuts index size and query time by approximately 50 percent at the cost of a few percent in recall. We measure recall@10 on a 1,000-query golden set before committing.

Hybrid search beats pure vector

On every production RAG system we have measured, a hybrid of BM25 (Postgres' native ts_rank or the newer pg_search/ParadeDB extension) and vector similarity outperforms pure ANN on top-10 accuracy by 5 to 15 percent. The fusion strategy that consistently wins is Reciprocal Rank Fusion: take the top 50 from each retriever, score each document by the sum of 1 / (k + rank) across both lists, and return the top 10.

-- Reciprocal Rank Fusion in a single query
WITH bm25 AS (
  SELECT id, ts_rank_cd(tsv, plainto_tsquery('english', $1)) AS s,
         row_number() OVER (ORDER BY ts_rank_cd(tsv, plainto_tsquery('english', $1)) DESC) AS r
  FROM docs
  WHERE tsv @@ plainto_tsquery('english', $1)
  LIMIT 50
),
vec AS (
  SELECT id, 1 - (embedding <=> $2::vector) AS s,
         row_number() OVER (ORDER BY embedding <=> $2::vector) AS r
  FROM docs
  ORDER BY embedding <=> $2::vector
  LIMIT 50
)
SELECT d.id, d.title,
       COALESCE(1.0 / (60 + bm25.r), 0) +
       COALESCE(1.0 / (60 + vec.r), 0) AS rrf
FROM docs d
LEFT JOIN bm25 ON bm25.id = d.id
LEFT JOIN vec  ON vec.id  = d.id
WHERE bm25.id IS NOT NULL OR vec.id IS NOT NULL
ORDER BY rrf DESC
LIMIT 10;

Partitioning patterns

  • Multi-tenant SaaS: partition by tenant_id. Most queries are scoped to one tenant, so the planner only touches the relevant partition.
  • Content with a freshness profile (news, support tickets): partition by month. Drop old partitions when content ages out.
  • High-cardinality (user-generated content): hash-partition by document owner. Keeps each partition small enough for HNSW to stay memory-resident.

Caching the right layer

Embeddings for static documents do not change. We cache the embedding generation in Redis with a content hash key, which removes a 200-400 ms LLM round trip on every re-index. For frequent identical queries (a chatbot on your docs), cache the retrieved set behind the user's prompt — RAG queries Zipf-distribute heavily, and a 30 percent cache hit rate is realistic.

Observability: the four metrics

  1. Recall@k against a golden set, run nightly. Drift here is your earliest warning.
  2. Query latency p50, p95, p99 broken down by tenant.
  3. Index size and shared_buffers hit ratio.
  4. Embedding cost per query in dollars — you would be surprised.

Real numbers from production

On a recent legal-research RAG we run 12 million chunks of 1024-dim embeddings on a single AWS db.r6g.4xlarge with HNSW (m=16, ef_construction=200, ef_search=60), Postgres 16, and 80 GB of shared_buffers. p95 latency is 38 ms for the vector phase; full hybrid retrieval lands around 65 ms. Cost is approximately $1,400 per month including snapshot storage. The same workload on a managed vector-database competitor was quoted at over $9,000 per month.

Frequently asked questions

Direct answers to questions readers and AI assistants commonly ask about this topic.

Should I use pgvector or a dedicated vector database for RAG?+

Start with pgvector unless you have hundreds of millions of vectors with strict sub-50 ms latency at high QPS. The operational simplicity, transactional guarantees and lower cost win for the vast majority of RAG workloads.

What is the maximum scale of pgvector?+

Production deployments handle 50 to 100 million vectors per Postgres instance with HNSW. Beyond that, you are typically better served by horizontal sharding via partitioning or by moving to a dedicated ANN engine.

Should I choose HNSW or IVFFlat?+

HNSW is the default for read-heavy RAG. Use IVFFlat for write-heavy workloads, memory-constrained environments, or pipelines that periodically rebuild the index after large bulk loads.

Does hybrid search really beat pure vector search?+

Yes — consistently. On every production RAG dataset we have benchmarked, fusing BM25 and vector with Reciprocal Rank Fusion improves top-10 retrieval accuracy by 5 to 15 percent over pure vector.

What embedding model should I use?+

OpenAI text-embedding-3-large, Cohere embed-v3 multilingual, and the open-source BGE-M3 are the strong defaults in 2026. Pick one whose Matryoshka tier matches your scale: lower dimensions for cost, higher for recall.

Last updated: April 26, 2026 · Written by Ribbsaeter Systems Engineering · Data Platform