3. Vector Indexing & Storage

"Embeddings are the bridge between text and vector search. Storage optimization is the key to production RAG systems." — RAG Infrastructure Principle

This chapter covers embedding model selection, batch generation strategies, caching techniques, index algorithm principles, advanced indexing strategies, vector database architecture, and production optimization for RAG systems.

3.1 Understanding Embeddings in RAG

The Role of Embeddings

Embeddings are the bridge between text and vector search. They convert semantic meaning into numerical vectors that can be compared mathematically.

Key concepts:

Vector Space: Similar concepts are close together in high-dimensional space
Dimensionality: Higher dimensions capture more nuance but cost more
Model Selection: Different models optimized for different use cases
Batch Processing: Generate embeddings in batches to reduce API calls
Caching: Cache embeddings to avoid recomputation

Why Vector Indexing Matters

The Indexing Advantage:

Approach	Search Time	Accuracy	Memory	Use Case
Linear Scan	10-30s	100%	Low	< 1K documents
IVF (Inverted File)	100-500ms	95%	Medium	1K-100K docs
HNSW (Hierarchical Small World)	10-50ms	98%	High	100K-10M docs
Quantization	5-20ms	90%	Very Low	10M+ docs

3.2 Index Fundamentals

3.2.1 What is an Index?

An index is a data structure that accelerates data retrieval by avoiding full table scans.

Book Index Analogy:

Without index: Read every page to find "machine learning"
With index: Look up "machine learning" in index → jump to pages 42, 87, 134
Result: 100x faster lookup

Database Index vs. Vector Index:

Aspect	Traditional Database (B-tree)	Vector Index (HNSW/IVF)
Query type	Exact match (WHERE id = 42)	Similarity (closest vectors)
Structure	Balanced tree	Graph or clustering
Complexity	O(log n) lookup	O(log n) approximate search
Use case	Structured data	Unstructured semantic search

3.2.2 The Vector Search Challenge

The Curse of Dimensionality:

As dimensions increase, distance metrics lose meaning and search becomes computationally intractable.

Why Traditional Indexes Fail for Vectors:

B-tree doesn't support similarity search: B-tree organizes data by sorted keys, not spatial proximity
High-dimensional space is sparse: Tree structures become inefficient as dimensionality increases
Distance computation is expensive: Calculating cosine similarity across 1536 dimensions is costly

Solution: Approximate Nearest Neighbor (ANN) algorithms trade small accuracy loss for massive speed improvement.

3.2.3 Index Types for Vectors

Exact vs. Approximate Search:

Index Type	Search Method	Time Complexity	Accuracy	When to Use
Flat (Linear Scan)	Compare query with every vector	O(n × d)	100%	< 1K documents
IVF (Inverted File)	Search only relevant Voronoi cells	O(√n × d)	92-95%	1K-100K documents
HNSW (Small World Graph)	Navigate probabilistic graph	O(log n × d)	97-99%	100K-10M documents
Quantized Index	Compressed vectors + ANN	O(log n × d/4)	90-95%	10M+ documents

Accuracy vs. Speed Trade-offs:

3.3 Index Algorithm Principles

3.3.1 Linear Scan (Flat Index)

How It Works:

Compare query vector with every vector in the dataset, return top-K closest.

For each document in database:
    distance = cosine_similarity(query, document.embedding)
    if distance < best_distance:
        add to results
Return top-K results

Characteristics:

Property	Value
Build time	0s (no index structure)
Search time	O(n × d) where n=docs, d=dimensions
Memory overhead	0% (just store vectors)
Accuracy	100% (exact search)
Best for	< 1K documents

When Linear Scan is Best:

Small datasets: < 1K documents, index overhead isn't worth it
Accuracy-critical applications: Legal, medical, financial where 2% error is unacceptable
Dynamic data: Frequent updates make index maintenance expensive
Batch processing: Overnight indexing where speed isn't critical

Performance Example:

1K documents × 1536 dimensions
Search time: ~10ms
Memory: 6 MB for vectors only
No build time

3.3.2 IVF (Inverted File)

How IVF Works:

IVF partitions vector space into Voronoi cells using clustering (typically k-means).

Voronoi Cell Diagram:

    Cluster 1       Cluster 2       Cluster 3
   (Centroid C1)   (Centroid C2)   (Centroid C3)

  *   *   *           *   *   *           *   *   *
    * * * *             * * * *             * * * *
  *   *   *           *   *   *           *   *   *

Query Q is closest to C2 → Search only Cluster 2

IVF Parameters:

Parameter	Description	Default	Effect
nlist	Number of clusters	1000	Higher = better precision, slower search
nprobe	Clusters to search	10	Higher = better recall, slower search

Characteristics:

Property	Value
Build time	Minutes (requires clustering)
Search time	O(√n × d) with nprobe=10
Memory overhead	Medium (cluster assignments)
Accuracy	92-95% recall
Best for	1K-100K documents

When IVF is Best:

Medium datasets: 1K-100K documents
Limited memory: Lower memory overhead than HNSW
Batch updates: Can rebuild index periodically

Limitations:

Requires training: Need to run k-means before indexing
Poor for dynamic data: Updating index is expensive
Cluster boundary issues: Documents near cluster edges may be missed

3.3.3 HNSW (Hierarchical Navigable Small World)

How HNSW Works:

HNSW builds a probabilistic skip list (layered graph) for logarithmic-time search.

Graph Construction Algorithm:

For each new point:
    1. Determine point's level (probabilistic)
       - Level 0: 100% of points
       - Level 1: 10% of points
       - Level 2: 1% of points
       - Level 3+: 0.1% of points

    2. Find entry point (top-level point)

    3. For each level from top down:
        a. Greedy search for nearest neighbors in that level
        b. Select M neighbors based on ef_construction parameter
        c. Bidirectional connections between points

    4. In base layer (Level 0):
        a. Find M nearest neighbors
        b. Create bidirectional edges

HNSW Parameters:

Parameter	Description	Default	Range	Effect
M	Max connections per node	16	8-64	Higher = better recall, more memory
ef_construction	Index build quality	100	50-200	Higher = better index, slower build
ef_search	Search candidates	50	10-100	Higher = better recall, slower search

Characteristics:

Property	Value
Build time	Minutes (incremental)
Search time	O(log n × d)
Memory overhead	High (graph structure)
Accuracy	97-99% recall
Best for	100K-10M documents

When HNSW is Best:

Large datasets: 100K-10M documents
Real-time search: Sub-50ms query latency
Dynamic data: Incremental updates supported
High recall required: 97-99% accuracy acceptable

Why HNSW is Fast:

Logarithmic search: Skip list structure enables O(log n) search
Greedy navigation: Always move closer to target
Layered approach: Coarse-to-fine search reduces distance computations
Probabilistic sampling: Only small fraction of points in higher layers

3.3.4 Algorithm Comparison

Comprehensive Comparison Table:

Algorithm	Build Time	Search Time	Memory	Accuracy	Best For	Limitations
Flat	0s	O(n×d) = 10-30s	Low	100%	< 1K docs, accuracy-critical	Doesn't scale
IVF	Minutes	O(√n) = 100-500ms	Medium	92-95%	1K-100K docs	Poor for dynamic data
HNSW	Minutes	O(log n) = 10-50ms	High	97-99%	100K-10M docs	High memory overhead
PQ-HNSW	Hours	O(log n) = 5-20ms	Medium	90-95%	10M+ docs	Accuracy loss, complex setup

Decision Tree:

Dataset size?
├─ < 1K → Flat (no index overhead, 100% accuracy)
├─ 1K-100K → IVF (balanced speed/accuracy, simpler than HNSW)
├─ 100K-10M → HNSW (best performance, industry standard)
└─ 10M+ → PQ-HNSW (quantization for memory efficiency)

Memory constrained?
├─ Yes → IVF or PQ-HNSW
└─ No → HNSW (best performance)

Dynamic data (frequent updates)?
├─ Yes → HNSW (supports incremental updates)
└─ No → IVF (simpler, good for batch workloads)

Accuracy critical (legal/medical)?
├─ Yes → Flat (100% accuracy) or high-ef HNSW
└─ No → IVF or HNSW (trade accuracy for speed)

3.4 Embedding Model Selection

Model Comparison

Model	Dimensions	Speed	Quality	Cost	Best For
OpenAI text-embedding-3-small	1536	Fast	⭐⭐⭐⭐⭐	$0.02/1M tokens	General purpose, cost-sensitive
OpenAI text-embedding-3-large	3072	Medium	⭐⭐⭐⭐⭐	$0.13/1M tokens	High accuracy required
BGE-M3	1024	Fast	⭐⭐⭐⭐	Free (self-hosted)	Chinese, multilingual, cost-sensitive
BGE-Large-EN	1024	Fast	⭐⭐⭐⭐	Free (self-hosted)	English-only, cost-sensitive
Cohere embed-v3	1024	Fast	⭐⭐⭐⭐⭐	$0.10/1M tokens	Hybrid retrieval, reranking

Selection Guide

Decision Tree:

Is cost a major concern?
├─ Yes → Use BGE-M3 (free, self-hosted)
└─ No → Is accuracy critical?
    ├─ Yes → Use OpenAI text-embedding-3-large
    └─ No → Use OpenAI text-embedding-3-small (best value)

Is Chinese/multilingual content?
├─ Yes → Use BGE-M3 (optimized for multilingual)
└─ No → Use OpenAI models (better English performance)

Need hybrid retrieval (dense + sparse)?
├─ Yes → Use Cohere embed-v3 (built-in sparse embeddings)
└─ No → Use OpenAI or BGE models

Dimensionality Trade-offs

Cost Analysis (for 1M documents, 500 tokens each):

Dimensions	Storage (GB)	Memory (GB)	Search Time	Cost
256	1 GB	2 GB	~5ms	$2.50
768	3 GB	6 GB	~10ms	$5.00
1536	6 GB	12 GB	~20ms	$10.00
3072	12 GB	24 GB	~40ms	$65.00

3.5 Advanced Indexing Strategies

3.5.1 Product Quantization (PQ)

What is Product Quantization?

PQ compresses high-dimensional vectors into short codes by partitioning dimensions and quantizing each partition separately.

How PQ Works:

1. Partition vector into sub-vectors (e.g., 1536 dims → 8 sub-vectors of 192 dims)

2. Train k-means on each sub-vector (typically 256 clusters per sub-vector)

3. For each vector:
   - Assign each sub-vector to nearest cluster center
   - Store cluster ID (8 bits) instead of full sub-vector

4. Result: 1536-dim vector (6 KB) → 8-byte code (99.9% compression)

PQ Parameters:

Parameter	Description	Typical Value	Effect
M	Number of sub-vectors	8-64	Higher = better accuracy, slower search
nbits	Bits per sub-vector	8	Higher = more clusters, better accuracy

Characteristics:

Property	Value
Compression ratio	90-99%
Memory savings	10-100x
Accuracy loss	2-5%
Build time	Hours (requires training)
Best for	10M+ documents, memory-constrained

When to Use PQ:

Very large datasets: 10M+ documents where memory is limiting
Memory-constrained environments: Edge devices, small servers
Cost-sensitive: Reduce cloud storage costs

Trade-offs:

Pro	Con
10-100x memory reduction	2-5% accuracy loss
Faster search (less data to load)	Complex training process
Lower storage costs	Longer build time

3.5.2 DiskANN

What is DiskANN?

DiskANN enables vector search on datasets larger than memory by storing the index on disk and loading pages on-demand.

How DiskANN Works:

Characteristics:

Property	Value
Max dataset size	Billions of vectors (limited by disk, not RAM)
Search time	100-500ms (disk I/O bottleneck)
Memory requirement	10-20% of dataset size (cache)
Accuracy	90-95%
Best for	100M+ document collections

When to Use DiskANN:

Massive datasets: 100M+ documents that don't fit in memory
Archival search: Latency tolerance (100-500ms acceptable)
Cost-sensitive: SSD is cheaper than RAM

Trade-offs:

Pro	Con
Scale to billions of vectors	10-100x slower than in-memory
Lower hardware costs (SSD vs RAM)	Complex setup and optimization
Scales horizontally (distributed disk)	SSD wear and tear

3.5.3 Composite Indexes

What are Composite Indexes?

Composite indexes combine multiple indexes for different aspects of documents (semantic, keyword, metadata).

Architecture:

Fusion Strategies:

Strategy	Description	When to Use
Reciprocal Rank Fusion (RRF)	Combine ranked lists	General purpose
Weighted fusion	Weight semantic + keyword	Known optimal weights
Cascade	Keyword filter → semantic rerank	Keyword-heavy queries
Learned fusion	ML model to combine results	Large-scale production

Benefits:

Better recall: Capture semantic and lexical matches
Filtered search: Metadata filtering with vector search
Flexibility: Optimize each index independently

When to Use Composite Indexes:

Hybrid retrieval: Need both semantic and keyword search
Metadata filtering: Category, date, author filters
Multi-modal: Text, image, audio search
Reranking: Cheap pre-filter → expensive rerank

3.5.4 Multi-Vector Indexing

What is Multi-Vector Indexing?

Store multiple vector representations per document for different query types.

Example Architecture:

Document: "Machine Learning with TensorFlow"

Vectors stored:
├─ Dense embedding (1536 dims): Semantic meaning
├─ Sparse embedding (10K dims): Keywords, entities
├─ Summary embedding (1536 dims): Document overview
├─ Title embedding (1536 dims): Title-specific
└─ Section embeddings (N × 1536): Per-section

Query Processing:

Query: "How to use TensorFlow for ML?"

1. Generate query embedding
2. Search all vector types in parallel
3. Fuse results:
   - Title match: 2.0× weight (high relevance)
   - Summary match: 1.5× weight (good overview)
   - Section match: 1.2× weight (specific detail)
   - Dense match: 1.0× weight (semantic)
   - Sparse match: 0.8× weight (keyword)
4. Return fused ranked results

Benefits:

Benefit	Description
Better recall	Capture different aspects of relevance
Query-type awareness	Optimize for different query patterns
Granular search	Title vs. content vs. section search

Trade-offs:

Pro	Con
2-5x better recall	2-5x storage cost
Flexible retrieval	Complex query orchestration
Better user experience	Slower queries (multiple indexes)

When to Use Multi-Vector Indexing:

Long documents: Single embedding insufficient
Structured documents: Title, abstract, sections
Multiple query types: Navigational vs. informational
High recall required: Enterprise search, e-commerce

3.6 Index Optimization Techniques

3.6.1 Quantization Methods

Scalar Quantization

Convert float32 vectors to int8 by dividing each dimension by a constant.

Float32 vector: [0.234, -0.567, 0.891, ...]
Quantization factor: 0.01
Int8 vector: [23, -57, 89, ...]

Memory reduction: 75% (4 bytes → 1 byte per dimension)
Accuracy loss: 1-3%

Scalar Quantization Parameters:

Parameter	Description	Effect
Quantization range	Min/max values to quantize	Larger range = more precision loss
Uniform vs. non-uniform	Spacing of quantization levels	Non-uniform better for skewed data

Product Quantization (PQ)

Compress vectors by partitioning and clustering each partition (see 3.5.1).

Comparison:

Method	Memory Reduction	Accuracy Loss	Speed	Complexity
Scalar (int8)	75%	1-3%	2x faster	Low
Product (PQ)	90%+	2-5%	4x faster	High
Binary	97%	5-10%	10x faster	Medium

When to Use Each:

Dataset size?
├─ < 10M → Scalar quantization (good balance)
├─ 10M-100M → Product quantization (memory savings)
└─ 100M+ → Binary + PQ (extreme compression)

Accuracy sensitivity?
├─ High → Scalar (minimal loss)
├─ Medium → PQ (acceptable loss)
└─ Low → Binary (max compression)

3.6.2 Index Partitioning Strategies

Category-Based Partitioning

Create separate indexes per document category.

Benefits:

3-10x faster filtered search (search only relevant partition)
Parallel indexing (build partitions independently)
Easier maintenance (rebuild single partition)

When to Use:

Clear category boundaries: Tech, Legal, Medical, etc.
Category filters in queries: Most queries specify category
Uneven query distribution: 80% of queries target 20% of categories

Time-Based Partitioning

Create indexes per time period (day, week, month).

Index Structure:
├─ 2024-01 Index (Jan 2024 documents)
├─ 2024-02 Index (Feb 2024 documents)
├─ ...
└─ 2024-12 Index (Dec 2024 documents)

Query: "Recent news about AI"
→ Search only last 3 partitions (3x faster)

Benefits:

Faster recent-document searches (most queries target recent content)
Time travel queries (search specific time periods)
Easier archival (delete old partitions)

When to Use:

Temporal queries: Most queries target recent documents
Data retention policies: Delete old data regularly
Time-series analysis: Trend detection over time

Sharding

Distribute index across multiple servers.

Sharding Strategies:

Strategy	Description	Pros	Cons
Hash-based	Hash document ID to shard	Even distribution	Cross-shard queries hard
Range-based	Document ID ranges to shard	Range queries easy	Uneven load
Category-based	Category to shard mapping	Targeted queries	Load imbalance
Geographic	User location to shard	Low latency	Complex routing

When to Use Sharding:

Single-server limit: Dataset doesn't fit on one machine
Query throughput: Need to handle 1000+ QPS
Fault tolerance: Replicate shards across servers

3.6.3 Index Maintenance

Incremental Updates

Add new documents to existing index without full rebuild.

Approach	Description	When to Use
HNSW incremental	Add points to graph	Dynamic data, < 10% daily growth
IVF append	Add to clusters	Batch updates, < 5% daily growth
Full rebuild	Rebuild from scratch	High churn, > 10% daily growth

Trade-offs:

Approach	Speed	Accuracy	Complexity
Incremental	Fast (ms per doc)	Degrades over time	Medium
Periodic rebuild	Slow (hours)	Always optimal	Low

Index Rebuilding

Periodic full rebuild to maintain index quality.

Rebuild Frequency Guidelines:

Update Rate	Rebuild Frequency	Rationale
< 1% daily	Monthly	Negligible degradation
1-5% daily	Weekly	Balance cost and quality
5-10% daily	Daily	Maintain optimal performance
> 10% daily	Continuous streaming	Real-time indexing

Deletion Strategies

Strategy	Description	Speed	Consistency	Best For
Lazy deletion	Mark deleted, rebuild later	Fast	Eventually consistent	High-churn datasets
Immediate deletion	Remove from index	Slow	Strongly consistent	Low-churn datasets
Soft deletion	Keep tombstone, filter queries	Fast	Query-time filtering	Compliance, audit trails

3.7 Vector Database Architecture

3.7.1 Core Components

3.7.2 Storage Engine Comparison

Database	Storage Engine	Index Support	ACID	Metadata	Best For
PgVector	PostgreSQL	HNSW, IVF	✅	Native SQL	< 10M docs, SQL-heavy workloads
Milvus	etcdKV + MinIO/S3	HNSW, IVF, DiskANN	❌	Built-in	1M+ docs, scalable
Pinecone	Proprietary	HNSW, PQ	❌	Limited	Production, managed service
Qdrant	Rust + etcd	HNSW, hybrid	✅	Native	Hybrid search, real-time
Weaviate	Go + BoltDB	HNSW	❌	Graph-based	GraphQL APIs

PgVector Internals:

Storage Format:
├─ Vectors stored as ARRAY[float] column type
├─ HNSW: Graph edges in separate table
└─ Metadata: Native PostgreSQL columns

Query Execution:
SELECT id, content, embedding <-> query_vector AS distance
FROM documents
WHERE category = 'tech'
ORDER BY distance
LIMIT 10;

Flow:
1. Filter by category (B-tree index)
2. Compute distance for filtered docs (HNSW index)
3. Sort by distance
4. Return top-10

Milvus Internals:

Storage Format:
├─ etcd: Metadata (schema, index stats)
├─ MinIO/S3: Vector data (persistent storage)
└─ Memory: Loaded index segments

Query Execution:
1. Proxy receives query
2. Query router identifies relevant shards
3. Query nodes search local index segments
4. Results aggregated and returned

3.7.3 Distributed Indexing

Problem: Single-server indexing doesn't scale

Solution: Distributed indexing across cluster

Architecture Patterns:

1. Data Parallelism (Most Common):

Documents split across nodes:
├─ Node 1: 25% of documents (independent index)
├─ Node 2: 25% of documents (independent index)
├─ Node 3: 25% of documents (independent index)
└─ Node 4: 25% of documents (independent index)

Query flow:
1. Broadcast query to all nodes
2. Each node searches its local index
3. Merge results from all nodes
4. Return global top-K

Trade-off: Network overhead, but scales horizontally

2. Index Parallelism:

All nodes see all documents:
├─ Node 1: HNSW index (all docs)
├─ Node 2: IVF index (all docs)
├─ Node 3: PQ index (all docs)
└─ Node 4: Flat index (all docs)

Query flow:
1. Route to appropriate node (by index type)
2. Search single index
3. Return results

Trade-off: Complex coordination, high memory per node

3. Hybrid Approach (Production Standard):

Shard by metadata category:
├─ Node 1: Tech category index (all tech docs)
├─ Node 2: Legal category index (all legal docs)
├─ Node 3: Medical category index (all medical docs)
└─ Node 4: Finance category index (all finance docs)

Query flow:
1. Classify query category
2. Route to relevant node(s)
3. Parallel search on targeted nodes
4. Merge results

Trade-off: Best balance, requires category classification

3.7.4 Query Execution Flow

User Query: "How do I configure Redis for production?"
    ↓
Query Parser: Parse DSL/SQL
    ├─ Extract: "configure", "Redis", "production"
    ├─ Classify: Tech category
    └─ Detect: Keyword-heavy query
    ↓
Query Planner: Optimize search strategy
    ├─ Metadata filtering: category=tech, year>=2023
    ├─ Index selection:
    │   ├─ Keyword index (BM25): "configure", "Redis"
    │   ├─ Semantic index (HNSW): Query embedding
    │   └─ Metadata index (B-tree): category=tech
    └─ Parallel execution plan:
        ├─ Thread 1: Search keyword index
        ├─ Thread 2: Search semantic index
        └─ Thread 3: Filter by metadata
    ↓
Executor: Parallel search
    ├─ Thread 1 (Keyword): Returns [doc42, doc87, doc153, ...]
    ├─ Thread 2 (Semantic): Returns [doc87, doc153, doc291, ...]
    └─ Thread 3 (Metadata): Filters to tech docs published 2023+
    ↓
Result Fusion: Reciprocal Rank Fusion (RRF)
    ├─ doc87: 1/1 + 1/1 + 1/1 = 3.0 (top rank)
    ├─ doc153: 1/2 + 1/2 + 1/2 = 1.5 (second rank)
    └─ doc291: 1/3 + 1/3 + 1/1 = 1.1 (third rank)
    ↓
Return Top-10 documents to user

3.7.5 Update Mechanisms

Database	Insert	Update	Delete	Impact
PgVector	Fast	Slow (rebuild needed)	Slow (mark dirty)	Best for append-only
Milvus	Fast	Fast (in-place)	Fast (in-place)	Best for dynamic data
Pinecone	Fast	Fast (upsert)	Fast (upsert)	Best for production
Qdrant	Fast	Fast (in-place)	Fast (in-place)	Best for real-time

Update Strategies:

PgVector (Append-Only):
Insert new vectors: Fast (add to table, index incrementally)
Update existing: Slow (need to rebuild index)
Delete: Slow (mark deleted, rebuild later)

Milvus (Dynamic):
Insert new vectors: Fast (add to segment)
Update existing: Fast (in-place update in segment)
Delete: Fast (mark deleted in segment metadata)

Pinecone (Upsert-Optimized):
Insert/Upsert: Fast (upsert API)
Delete: Fast (upsert with deleted=true flag)
Background: Automatic index optimization

3.7.6 Consistency and ACID

Database	ACID Support	Transaction Support	Implications
PgVector	✅ Full ACID	BEGIN/COMMIT	Strong consistency, slower writes
Milvus	❌ No ACID	Batch operations	Eventually consistent, faster writes
Pinecone	❌ No ACID	Atomic operations	Eventual consistency
Qdrant	✅ Full ACID	Transactional API	Strong consistency with real-time features

When ACID Matters:

Financial/Legal applications:
├─ Need strong consistency
├─ Use PgVector or Qdrant
└─ Accept slower writes for correctness

Real-time applications:
├─ Need fast writes, eventual consistency OK
├─ Use Milvus or Pinecone
└─ Accept brief inconsistency for speed

3.7.7 Scaling Strategies

Horizontal Scaling (Adding more nodes):

Replication (Copying data across nodes):

Strategy	Description	Pros	Cons
Leader-follower	One leader writes, followers read	Simple, consistent	Leader bottleneck
Multi-leader	Any node can write	No bottleneck	Conflict resolution
Sharding + replica	Each shard replicated	Balanced	Complex setup

Failover:

Node failure detection:
Health check: Ping all nodes every 5 seconds
Failure detection: Node unresponsive for 15 seconds
Re-routing: Redirect queries to replica nodes
Recovery: Rebuild failed node from replica
Re-balance: Redistribute data across cluster

3.8 Batch Embedding Generation

The Batch Processing Problem

Problem: Generating embeddings one-by-one is expensive and slow

Approach	API Calls	Time	Cost (100K docs)
Individual	100,000	~14 hours	$10.00
Batch (100)	1,000	~10 minutes	$10.00
Batch (1000)	100	~2 minutes	$10.00

Key Insight: Batch processing doesn't reduce cost (API pricing is per token), but provides 50-100x speed improvement.

Batch Size Optimization

Batch Size Guidelines:

Document Count	Optimal Batch	Reason	Speed Improvement
< 1,000	50-100	Balance efficiency vs memory	50x
1,000 - 10,000	100-200	Maximize throughput	100x
10,000 - 100,000	200-500	Minimize API overhead	150x
100,000+	500-1000	Maximum efficiency	200x

Batch Processing Best Practices

Practice	Why	Implementation
Automatic retry	API failures are common	Exponential backoff (3-5 retries)
Progress monitoring	Track long-running jobs	Log batch progress, ETA calculation
Cost tracking	Avoid surprise bills	Estimate cost before processing
Error resilience	One bad batch shouldn't fail all	Skip failed documents, continue processing
Rate limiting	Avoid API throttling	Control concurrent requests

3.9 Embedding Caching

Cache Architecture

Cache embeddings to avoid recomputing for identical or similar text:

Cache Strategy Comparison

Strategy	Hit Rate	Memory	Complexity	Use Case
No Cache	0%	0 MB	Low	One-time indexing
LRU Cache (100K)	20-40%	6 GB	Low	General purpose
TTL Cache (7 days)	30-50%	6 GB	Low	Time-sensitive content
Persistent Cache	50-80%	Disk	High	Large-scale, repetitive
Semantic Cache	40-60%	8 GB	High	Similar queries

Cache Best Practices

Practice	Implementation	Impact
Cache query embeddings	Hash query text, check cache first	80% cost reduction for repeated queries
Cache document embeddings	Hash document content, check before generating	20-40% cost reduction during re-indexing
Monitor hit rate	Track cache hits vs misses	Optimize cache size and TTL
Set appropriate TTL	7 days for most content	Balance freshness and hit rate
Persistent cache	Store cache on disk for restarts	Avoid re-generation after restart

3.10 Production Optimization

Performance Benchmarks

Index Size vs Document Count:

Documents	Dimensions	Index Size	Memory	Build Time	Search Time
10K	1536	100 MB	200 MB	30s	5ms
100K	1536	1 GB	2 GB	5min	10ms
1M	1536	10 GB	20 GB	45min	20ms
10M	1536	100 GB	200 GB	6hr	40ms

HNSW Parameter Tuning

Parameter Impact Visualization:

Tuning Guidelines:

Use Case	M	ef_construction	ef_search	Expected Recall
Real-time (fast queries)	12	80	20	90-92%
Balanced (default)	16	100	50	95-97%
High accuracy	24	150	100	97-99%
Maximum accuracy	32	200	150	99%+

Cost Optimization Strategies

Optimization Strategy Details:

Optimization	Description	When to Use	Trade-offs
Quantization	Float32 → INT8 vectors	Memory constrained	2% accuracy loss, 75% memory savings
Dimensionality reduction	PCA to reduce dimensions	Very large corpora	5% accuracy loss, 50% memory savings
Index partitioning	Separate indexes per category	Metadata-heavy queries	More complex queries
Hybrid search	Combine dense + sparse vectors	Keyword + semantic queries	50% cost increase, 10% recall improvement

Monitoring & Alerting

Key Metrics to Track:

Index Health
- Document count
- Index size
- Fragmentation level
Query Performance
- P50, P95, P99 latency
- Recall rate (vs ground truth)
- Timeout rate
Cost Tracking
- API call volume
- Token usage
- Cache hit rate

Alert Thresholds:

Metric	Warning	Critical	Action
Query latency (P95)	> 100ms	> 500ms	Check ef_search, index health
Recall rate	< 90%	< 80%	Increase ef_search, check embedding quality
Cache hit rate	< 20%	< 10%	Review cache strategy, preload common queries
API error rate	> 5%	> 10%	Check rate limits, retry logic
Index fragmentation	> 50% overhead	> 100% overhead	Rebuild index, VACUUM ANALYZE

3.11 Interview Q&A

Q1: Explain the curse of dimensionality problem

The curse of dimensionality refers to the phenomenon where distance metrics lose meaning in high-dimensional spaces.

Key points:

Distance concentration: In high dimensions, all points become approximately equidistant
Empty space phenomenon: Volume grows exponentially, data becomes sparse
Implication for indexing: Traditional tree-based indexes become ineffective
Solution: Approximate Nearest Neighbor (ANN) algorithms trade small accuracy loss for massive speed improvement

Example:

2D space: Euclidean distance works well, points have clear nearest neighbors
1000D space: Most points are roughly the same distance apart, making "nearest neighbor" less meaningful

Q2: How does HNSW actually work?

HNSW (Hierarchical Navigable Small World) builds a probabilistic skip list for logarithmic-time approximate search.

How it works:

Layered structure:
- Layer 0: All points (base layer)
- Layer 1: Top 10% of points
- Layer 2: Top 1% of points
- Layer 3+: Top 0.1% of points
Graph construction:
- Each point connects to M nearest neighbors in its layer
- Higher layers have fewer points but longer-range connections
- Built incrementally, point by point
Search algorithm:
- Start at entry point in top layer
- Greedy search: Move to closer point in current layer
- When no closer point, move down to next layer
- In base layer, exhaustive search of nearest neighbors
- Return top-K candidates
Why it's fast:
- Logarithmic search: O(log n) due to layered structure
- Coarse-to-fine: Higher layers navigate quickly, base layer refines
- Probabilistic sampling: Only small fraction of points in higher layers

Key parameters:

M: Connections per point (16 default)
ef_construction: Candidates during build (100 default)
ef_search: Candidates during search (50 default)

Q3: When would you use IVF vs HNSW?

IVF (Inverted File):

Use when: Dataset is 1K-100K documents
Pros: Lower memory overhead, simpler implementation
Cons: Poor for dynamic data, requires clustering
Mechanism: Partitions space into Voronoi cells, searches only relevant cells

HNSW (Hierarchical Small World):

Use when: Dataset is 100K-10M documents
Pros: Faster search, supports incremental updates, higher recall
Cons: Higher memory overhead (graph structure)
Mechanism: Layered graph structure, logarithmic search

Decision factors:

Dataset size: < 100K → IVF, > 100K → HNSW
Update frequency: High churn → HNSW (supports incremental updates)
Memory constraints: Limited memory → IVF
Query latency: Sub-50ms required → HNSW

Q4: How do vector databases handle updates and deletions?

Update strategies vary by database:

PgVector:
- Insert: Fast (incremental)
- Update: Slow (requires index rebuild)
- Delete: Slow (mark deleted, rebuild later)
- Best for: Append-only workloads
Milvus:
- Insert: Fast (add to segment)
- Update: Fast (in-place update)
- Delete: Fast (mark deleted in segment metadata)
- Best for: Dynamic data
Pinecone:
- Insert/Upsert: Fast (upsert API)
- Delete: Fast (upsert with deleted flag)
- Best for: Production, real-time

Why updates are challenging:

HNSW: Removing a point requires updating graph edges (expensive)
IVF: Changing a point's cluster requires re-clustering
PQ: Updating requires recomputing quantization codes

Common strategies:

Lazy deletion: Mark deleted, rebuild periodically
Soft deletion: Keep tombstone, filter at query time
Immediate deletion: Remove and update graph (complex)

Q5: What are the trade-offs between different vector databases?

PgVector:

Pros: SQL queries, ACID transactions, metadata filtering, open-source
Cons: Slower than dedicated databases, limited scalability
Best for: < 10M docs, SQL-heavy workloads, need metadata filtering

Milvus:

Pros: Highly scalable, feature-rich, supports multiple indexes
Cons: Complex setup, no ACID, eventually consistent
Best for: 1M+ docs, self-hosted, scalable architectures

Pinecone:

Pros: Fully managed, auto-scaling, simple API
Cons: Expensive, vendor lock-in, limited metadata
Best for: Production, no ops team, rapid development

Qdrant:

Pros: Hybrid search, ACID transactions, fast
Cons: Newer ecosystem, less mature
Best for: Hybrid retrieval, real-time applications

Decision factors:

Team expertise: SQL skills → PgVector
Scalability needs: < 10M → PgVector, > 10M → Milvus/Pinecone
Consistency requirements: ACID needed → PgVector/Qdrant
Ops capacity: No ops → Pinecone, have ops → Milvus/Qdrant

Q6: Explain product quantization and when to use it

Product Quantization (PQ) compresses vectors by partitioning dimensions and quantizing each partition.

How it works:

Partition vector into M sub-vectors (e.g., 1536 dims → 8 sub-vectors of 192 dims)
Train k-means on each sub-vector (typically 256 clusters)
For each vector, assign each sub-vector to nearest cluster center
Store cluster IDs (8 bits each) instead of full vectors
Result: 1536-dim vector (6 KB) → 8-byte code (99.9% compression)

When to use PQ:

Dataset size: 10M+ documents
Memory constrained: Can't fit full vectors in RAM
Cost-sensitive: Reduce cloud storage costs
Accuracy tolerance: Accept 2-5% accuracy loss

Trade-offs:

Pros: 10-100x memory reduction, faster search (less data to load)
Cons: 2-5% accuracy loss, complex training (hours to days), longer build time

Alternatives:

Scalar quantization: Simpler, less compression (75%), less accuracy loss (1-3%)
Binary quantization: Extreme compression (97%), higher accuracy loss (5-10%)

Q7: How do you optimize vector search performance?

Optimization strategies:

Index algorithm selection:
- < 1K docs: Flat (no index)
- 1K-100K docs: IVF
- 100K-10M docs: HNSW
- 10M+ docs: PQ-HNSW
Quantization:
- Scalar (int8): 75% memory reduction, 1-3% accuracy loss
- Product (PQ): 90%+ memory reduction, 2-5% accuracy loss
- Use when: Memory constrained or very large datasets
HNSW parameter tuning:
- M (connections): Higher = better recall, more memory
- ef_search (search candidates): Higher = better recall, slower search
- Real-time: M=16, ef_search=20
- Balanced: M=16, ef_search=50
- Accuracy-critical: M=32, ef_search=100
Index partitioning:
- Category-based: Separate indexes per category
- Time-based: Partition by creation date
- Sharding: Distribute across servers
Caching:
- LRU cache for query embeddings: 20-40% hit rate
- Preload common queries at startup
- Persistent cache for restart resilience
Batch processing:
- Batch size: 100-500 for optimal throughput
- Reduces API overhead 50-200x

Monitoring:

Track query latency (P50, P95, P99)
Monitor recall rate vs ground truth
Alert on degraded performance
Rebuild index periodically (weekly to monthly)

Summary

Key Takeaways

1. Index Fundamentals:

Indexes accelerate search by avoiding full scans
Vector indexes differ from traditional database indexes (similarity vs. exact match)
Curse of dimensionality makes high-dimensional search challenging
ANN algorithms trade small accuracy loss for massive speed improvement

2. Index Algorithms:

Flat (Linear Scan): 100% accuracy, < 1K docs
IVF: 92-95% accuracy, 1K-100K docs
HNSW: 97-99% accuracy, 100K-10M docs (industry standard)
PQ-HNSW: 90-95% accuracy, 10M+ docs

3. Advanced Strategies:

Product Quantization: 90%+ memory reduction, 2-5% accuracy loss
DiskANN: Scale to billions of vectors (disk-based)
Composite Indexes: Hybrid semantic + keyword + metadata
Multi-Vector: Multiple embeddings per document

4. Optimization Techniques:

Quantization: Scalar (75% reduction), Product (90%+ reduction)
Partitioning: Category-based, time-based, sharding
Maintenance: Incremental updates, periodic rebuilds
Deletion: Lazy (eventual), immediate (strong consistency)

5. Vector Database Architecture:

Components: Ingestion, storage, query execution
Storage engines: PgVector (SQL), Milvus (scalable), Pinecone (managed)
Distributed indexing: Data parallelism (most common)
Consistency: ACID (PgVector, Qdrant) vs. eventual (Milvus, Pinecone)

6. Production Best Practices:

HNSW tuning: M=16, ef_search=50 for balanced systems
Batch size: 100-500 for optimal throughput
Caching: LRU with 7-day TTL, 20-40% hit rate
Monitoring: Latency, recall, cost metrics with alerting

3.1 Understanding Embeddings in RAG​

The Role of Embeddings​

Why Vector Indexing Matters​

3.2 Index Fundamentals​

3.2.1 What is an Index?​

3.2.2 The Vector Search Challenge​

3.2.3 Index Types for Vectors​

3.3 Index Algorithm Principles​

3.3.1 Linear Scan (Flat Index)​

3.3.2 IVF (Inverted File)​

3.3.3 HNSW (Hierarchical Navigable Small World)​

3.3.4 Algorithm Comparison​

3.4 Embedding Model Selection​

Model Comparison​

Selection Guide​

Dimensionality Trade-offs​

3.5 Advanced Indexing Strategies​

3.5.1 Product Quantization (PQ)​

3.5.2 DiskANN​

3.5.3 Composite Indexes​

3.5.4 Multi-Vector Indexing​

3.6 Index Optimization Techniques​

3.6.1 Quantization Methods​

3.6.2 Index Partitioning Strategies​

3.6.3 Index Maintenance​

3.7 Vector Database Architecture​

3.7.1 Core Components​

3.7.2 Storage Engine Comparison​

3.7.3 Distributed Indexing​

3.7.4 Query Execution Flow​

3.7.5 Update Mechanisms​

3.7.6 Consistency and ACID​

3.7.7 Scaling Strategies​

3.8 Batch Embedding Generation​

The Batch Processing Problem​

Batch Size Optimization​

Batch Processing Best Practices​

3.9 Embedding Caching​

Cache Architecture​

Cache Strategy Comparison​

Cache Best Practices​

3.10 Production Optimization​

Performance Benchmarks​

HNSW Parameter Tuning​

Cost Optimization Strategies​

Monitoring & Alerting​

3.11 Interview Q&A​

Summary​

Key Takeaways​

Best Practices Checklist​

Further Reading​

3.1 Understanding Embeddings in RAG

The Role of Embeddings

Why Vector Indexing Matters

3.2 Index Fundamentals

3.2.1 What is an Index?

3.2.2 The Vector Search Challenge

3.2.3 Index Types for Vectors

3.3 Index Algorithm Principles

3.3.1 Linear Scan (Flat Index)

3.3.2 IVF (Inverted File)

3.3.3 HNSW (Hierarchical Navigable Small World)

3.3.4 Algorithm Comparison

3.4 Embedding Model Selection

Model Comparison

Selection Guide

Dimensionality Trade-offs

3.5 Advanced Indexing Strategies

3.5.1 Product Quantization (PQ)

3.5.2 DiskANN

3.5.3 Composite Indexes

3.5.4 Multi-Vector Indexing

3.6 Index Optimization Techniques

3.6.1 Quantization Methods

3.6.2 Index Partitioning Strategies

3.6.3 Index Maintenance

3.7 Vector Database Architecture

3.7.1 Core Components

3.7.2 Storage Engine Comparison

3.7.3 Distributed Indexing

3.7.4 Query Execution Flow

3.7.5 Update Mechanisms

3.7.6 Consistency and ACID

3.7.7 Scaling Strategies

3.8 Batch Embedding Generation

The Batch Processing Problem

Batch Size Optimization

Batch Processing Best Practices

3.9 Embedding Caching

Cache Architecture

Cache Strategy Comparison

Cache Best Practices

3.10 Production Optimization

Performance Benchmarks

HNSW Parameter Tuning

Cost Optimization Strategies

Monitoring & Alerting

3.11 Interview Q&A

Summary

Key Takeaways

Best Practices Checklist

Further Reading