3. Vector Indexing & Storage
"Embeddings are the bridge between text and vector search. Storage optimization is the key to production RAG systems." — RAG Infrastructure Principle
This chapter covers embedding model selection, batch generation strategies, caching techniques, index algorithm principles, advanced indexing strategies, vector database architecture, and production optimization for RAG systems.
3.1 Understanding Embeddings in RAG
The Role of Embeddings
Embeddings are the bridge between text and vector search. They convert semantic meaning into numerical vectors that can be compared mathematically.
Key concepts:
- Vector Space: Similar concepts are close together in high-dimensional space
- Dimensionality: Higher dimensions capture more nuance but cost more
- Model Selection: Different models optimized for different use cases
- Batch Processing: Generate embeddings in batches to reduce API calls
- Caching: Cache embeddings to avoid recomputation
Why Vector Indexing Matters
The Indexing Advantage:
| Approach | Search Time | Accuracy | Memory | Use Case |
|---|---|---|---|---|
| Linear Scan | 10-30s | 100% | Low | < 1K documents |
| IVF (Inverted File) | 100-500ms | 95% | Medium | 1K-100K docs |
| HNSW (Hierarchical Small World) | 10-50ms | 98% | High | 100K-10M docs |
| Quantization | 5-20ms | 90% | Very Low | 10M+ docs |
3.2 Index Fundamentals
3.2.1 What is an Index?
An index is a data structure that accelerates data retrieval by avoiding full table scans.
Book Index Analogy:
- Without index: Read every page to find "machine learning"
- With index: Look up "machine learning" in index → jump to pages 42, 87, 134
- Result: 100x faster lookup
Database Index vs. Vector Index:
| Aspect | Traditional Database (B-tree) | Vector Index (HNSW/IVF) |
|---|---|---|
| Query type | Exact match (WHERE id = 42) | Similarity (closest vectors) |
| Structure | Balanced tree | Graph or clustering |
| Complexity | O(log n) lookup | O(log n) approximate search |
| Use case | Structured data | Unstructured semantic search |
3.2.2 The Vector Search Challenge
The Curse of Dimensionality:
As dimensions increase, distance metrics lose meaning and search becomes computationally intractable.
Why Traditional Indexes Fail for Vectors:
- B-tree doesn't support similarity search: B-tree organizes data by sorted keys, not spatial proximity
- High-dimensional space is sparse: Tree structures become inefficient as dimensionality increases
- Distance computation is expensive: Calculating cosine similarity across 1536 dimensions is costly
Solution: Approximate Nearest Neighbor (ANN) algorithms trade small accuracy loss for massive speed improvement.
3.2.3 Index Types for Vectors
Exact vs. Approximate Search:
| Index Type | Search Method | Time Complexity | Accuracy | When to Use |
|---|---|---|---|---|
| Flat (Linear Scan) | Compare query with every vector | O(n × d) | 100% | < 1K documents |
| IVF (Inverted File) | Search only relevant Voronoi cells | O(√n × d) | 92-95% | 1K-100K documents |
| HNSW (Small World Graph) | Navigate probabilistic graph | O(log n × d) | 97-99% | 100K-10M documents |
| Quantized Index | Compressed vectors + ANN | O(log n × d/4) | 90-95% | 10M+ documents |
Accuracy vs. Speed Trade-offs:
3.3 Index Algorithm Principles
3.3.1 Linear Scan (Flat Index)
How It Works:
Compare query vector with every vector in the dataset, return top-K closest.
For each document in database:
distance = cosine_similarity(query, document.embedding)
if distance < best_distance:
add to results
Return top-K results
Characteristics:
| Property | Value |
|---|---|
| Build time | 0s (no index structure) |
| Search time | O(n × d) where n=docs, d=dimensions |
| Memory overhead | 0% (just store vectors) |
| Accuracy | 100% (exact search) |
| Best for | < 1K documents |
When Linear Scan is Best:
- Small datasets: < 1K documents, index overhead isn't worth it
- Accuracy-critical applications: Legal, medical, financial where 2% error is unacceptable
- Dynamic data: Frequent updates make index maintenance expensive
- Batch processing: Overnight indexing where speed isn't critical
Performance Example:
- 1K documents × 1536 dimensions
- Search time: ~10ms
- Memory: 6 MB for vectors only
- No build time
3.3.2 IVF (Inverted File)
How IVF Works:
IVF partitions vector space into Voronoi cells using clustering (typically k-means).
Voronoi Cell Diagram:
Cluster 1 Cluster 2 Cluster 3
(Centroid C1) (Centroid C2) (Centroid C3)
* * * * * * * * *
* * * * * * * * * * * *
* * * * * * * * *
Query Q is closest to C2 → Search only Cluster 2
IVF Parameters:
| Parameter | Description | Default | Effect |
|---|---|---|---|
| nlist | Number of clusters | 1000 | Higher = better precision, slower search |
| nprobe | Clusters to search | 10 | Higher = better recall, slower search |
Characteristics:
| Property | Value |
|---|---|
| Build time | Minutes (requires clustering) |
| Search time | O(√n × d) with nprobe=10 |
| Memory overhead | Medium (cluster assignments) |
| Accuracy | 92-95% recall |
| Best for | 1K-100K documents |
When IVF is Best:
- Medium datasets: 1K-100K documents
- Limited memory: Lower memory overhead than HNSW
- Batch updates: Can rebuild index periodically
Limitations:
- Requires training: Need to run k-means before indexing
- Poor for dynamic data: Updating index is expensive
- Cluster boundary issues: Documents near cluster edges may be missed
3.3.3 HNSW (Hierarchical Navigable Small World)
How HNSW Works:
HNSW builds a probabilistic skip list (layered graph) for logarithmic-time search.
Graph Construction Algorithm:
For each new point:
1. Determine point's level (probabilistic)
- Level 0: 100% of points
- Level 1: 10% of points
- Level 2: 1% of points
- Level 3+: 0.1% of points
2. Find entry point (top-level point)
3. For each level from top down:
a. Greedy search for nearest neighbors in that level
b. Select M neighbors based on ef_construction parameter
c. Bidirectional connections between points
4. In base layer (Level 0):
a. Find M nearest neighbors
b. Create bidirectional edges
HNSW Parameters:
| Parameter | Description | Default | Range | Effect |
|---|---|---|---|---|
| M | Max connections per node | 16 | 8-64 | Higher = better recall, more memory |
| ef_construction | Index build quality | 100 | 50-200 | Higher = better index, slower build |
| ef_search | Search candidates | 50 | 10-100 | Higher = better recall, slower search |
Characteristics:
| Property | Value |
|---|---|
| Build time | Minutes (incremental) |
| Search time | O(log n × d) |
| Memory overhead | High (graph structure) |
| Accuracy | 97-99% recall |
| Best for | 100K-10M documents |
When HNSW is Best:
- Large datasets: 100K-10M documents
- Real-time search: Sub-50ms query latency
- Dynamic data: Incremental updates supported
- High recall required: 97-99% accuracy acceptable
Why HNSW is Fast:
- Logarithmic search: Skip list structure enables O(log n) search
- Greedy navigation: Always move closer to target
- Layered approach: Coarse-to-fine search reduces distance computations
- Probabilistic sampling: Only small fraction of points in higher layers
3.3.4 Algorithm Comparison
Comprehensive Comparison Table:
| Algorithm | Build Time | Search Time | Memory | Accuracy | Best For | Limitations |
|---|---|---|---|---|---|---|
| Flat | 0s | O(n×d) = 10-30s | Low | 100% | < 1K docs, accuracy-critical | Doesn't scale |
| IVF | Minutes | O(√n) = 100-500ms | Medium | 92-95% | 1K-100K docs | Poor for dynamic data |
| HNSW | Minutes | O(log n) = 10-50ms | High | 97-99% | 100K-10M docs | High memory overhead |
| PQ-HNSW | Hours | O(log n) = 5-20ms | Medium | 90-95% | 10M+ docs | Accuracy loss, complex setup |
Decision Tree:
Dataset size?
├─ < 1K → Flat (no index overhead, 100% accuracy)
├─ 1K-100K → IVF (balanced speed/accuracy, simpler than HNSW)
├─ 100K-10M → HNSW (best performance, industry standard)
└─ 10M+ → PQ-HNSW (quantization for memory efficiency)
Memory constrained?