Skip to main content

2. Data Processing Pipeline

"Data quality is the foundation of RAG system performance. Garbage in, garbage out." — Fundamental Principle of Machine Learning

This chapter explores the complete data processing pipeline for RAG systems, focusing on processing logic, problem-solving approaches, and architectural decisions rather than implementation details.


2.1 Introduction to Data Processing Pipeline

Why Data Processing Matters

In real-world RAG applications, 80% of development effort goes into data processing, while only 20% into retrieval and generation. This is because:

  1. Raw data is messy: Real documents come in various formats, contain noise, and lack structure
  2. Retrieval quality depends on chunking: Poor chunking leads to irrelevant or fragmented context
  3. Metadata enables efficient filtering: Without proper metadata, every query requires expensive vector search
  4. Embedding costs add up: Processing millions of documents requires optimization strategies

The End-to-End Pipeline

Stage Deep Dive

StagePrimary GoalKey ChallengesImpact on Quality
1. Document LoadingParse diverse formats into structured textPDF text extraction, HTML cleaning, encoding issuesBase quality
2. Data CleaningRemove noise and assess qualityDuplicate detection, quality scoring, language detectionCritical
3. ChunkingSplit into semantically coherent piecesContext preservation, boundary detection, size optimizationMost Critical
4. MetadataExtract structured informationAutomatic extraction, LLM cost optimization, schema designImportant
5. EmbeddingConvert to vector representationsBatch optimization, model selection, cost managementImportant
6. StorageIndex for efficient retrievalIndex tuning, quantization, query optimizationMedium

Real-World Impact

Example: Processing a 10,000-page technical documentation set

AspectPoor PipelineOptimized Pipeline
CleaningNo cleaning → 30% duplicatesQuality filtering → clean content only
ChunkingFixed 256-token chunksSemantic chunking → coherent explanations
MetadataNo metadataRich metadata (category, date, version)
Result45% retrieval accuracy, slow queries, high costs85% retrieval accuracy, 3x faster, 60% cost reduction

2.2 Document Loading & Parsing

Understanding Document Loading

Document loading is the first critical bottleneck in RAG systems. In production, you'll face:

  1. Format diversity: PDFs from scanning, HTML from web scraping, Word docs from business processes
  2. Encoding issues: Non-UTF8 text, mixed character sets, corrupted files
  3. Structure preservation: Tables, images, footnotes, headers need special handling
  4. Scale requirements: Processing thousands of documents efficiently

Multi-Format Document Readers

FormatPrimary Use CasesKey ChallengesComplexity
PDFAcademic papers, reports, manualsMulti-column layout, tables, scanned PDFs (OCR)High
HTMLWeb articles, blogs, documentationNavigation elements, ads, dynamic contentMedium
MarkdownREADME files, technical docsFrontmatter parsing, code blocks, link referencesLow
DOCXBusiness documents, contractsStyles, embedded objects, track changesMedium
JSONAPI responses, logs, structured dataNested structures, large files, schema variationsLow
TXTPlain text files, code filesEncoding detection, line endings, character limitsLow

Format-Specific Problem-Solving

PDF Processing

ProblemSolutionApproach
PDFs store text by position, not reading orderUse layout-aware readersDetect columns, headers, and tables
Multi-column layoutsAnalyze reading orderIdentify column boundaries
Scanned PDFsOCR integrationTesseract with fallback to image captions
Tables lost in extractionSeparate table extractionStore as structured data with metadata
Images/charts not indexedVision model captioningGenerate searchable descriptions

HTML Processing

ProblemSolutionApproach
70% boilerplate (navigation, ads, footers)Content extraction algorithmsReadability, Mercury Parser
Dynamic contentJavaScript renderingPuppeteer, Playwright
Lost structurePreserve semantic HTMLKeep h1, h2, article tags

Markdown Processing

ProblemSolutionApproach
Frontmatter contains valuable metadataParse YAML separatelyMerge with document metadata
Code blocks break meaningExtract separatelyIndex independently
Link referencesResolve and trackBuild citation graph

Document Loading Best Practices

PracticeWhy It MattersImplementation Approach
Comprehensive metadataEnables pre-filtering, reduces search costAdd source, type, size, hash to every document
Format detectionAutomatic processing of diverse sourcesUse file extension + content type sniffing
Error resilienceOne bad file shouldn't stop batch processingCatch and log exceptions, continue processing
Parallel processing10x faster for large directoriesUse parallel streams with thread-safe readers
Progress monitoringTrack processing status in productionLog file count, success/failure rates
Structure preservationTables, headings, code blocks need special handlingFormat-specific readers with custom logic

2.3 Data Cleaning & Normalization

Why Data Cleaning is Critical

Real-world data is messy. Studies show that uncleaned data can reduce retrieval accuracy by 30-50%. Common issues include:

  1. Noise characters: Control characters, encoding artifacts, gibberish
  2. Duplicates: Same content appearing multiple times
  3. Low-quality content: Spam, boilerplate text, automated messages
  4. Formatting inconsistencies: Extra whitespace, inconsistent line endings

Text Cleaning Pipeline

Effective text cleaning requires a multi-stage approach using the chain-of-responsibility pattern:

Original Text

[Encoding Normalizer] → Fix mojibake, character corruption

[Control Character Remover] → Remove non-printable characters

[Whitespace Normalizer] → Normalize spacing and line breaks

[URL/Email Remover] → Remove personal info and noise

[HTML Tag Remover] → Strip markup, keep content

[Duplicate Line Remover] → Remove repetition

[Boilerplate Remover] → Remove common headers/footers

Cleaned Text

Common Noise Patterns and Solutions

Noise TypeExampleSolutionImpact
Encoding corruption€ instead of €Character mapping fixesPrevents embedding pollution
Control charactersASCII 0-31, 127Pattern removalReduces vector noise
Extra whitespaceMultiple spaces/tabsNormalize to single spaceImproves tokenization
Boilerplate"Copyright 2024..."Pattern-based removalFocuses on actual content
URLs/Emailshttps://..., user@...Regex removalPrivacy + noise reduction

Document Quality Assessment

Not all content is worth indexing. Quality scoring filters out low-value content:

Quality DimensionAssessment ApproachThresholdRationale
LengthPrefer 500-5000 charactersMin: 50, Max: 100,000Too short: incomplete; too long: unfocused
Meaningful contentRatio of alphanumeric charactersMin: 30%Low ratio indicates noise/data dumps
StructureHas sentences/paragraphsMin: 3 sentencesSingle-sentence fragments lack context
Vocabulary diversityUnique word ratioMin: 30% uniqueLow diversity = formulaic/repetitive

Deduplication Strategies

TypeDescriptionPrecisionCostUse Case
Exact DuplicationIdentical content (hash-based)100%LowRemove true duplicates
Near-DuplicationSimilar content (MinHash)~85%MediumRemove variations
Semantic DuplicationSame meaning (embeddings)~95%HighRemove paraphrases

MinHash Algorithm (Near-Duplication)

Problem: Detect similar but not identical documents (e.g., same document with minor edits)

Solution:

  1. Generate 3-word shingles from document
  2. Compute MinHash signature (100 hash functions)
  3. Estimate Jaccard similarity from signatures
  4. Remove documents with similarity > threshold (typically 0.85)

Why It Works:

  • Fast: O(n × k) where n = docs, k = shingles per doc
  • Accurate: ~85% precision for near-duplicate detection
  • Scalable: Fixed-size signatures (100 integers ~400 bytes)

Data Cleaning Best Practices

PracticeImplementationImpact
Pipeline approachChain multiple cleaners in sequenceModular, easy to customize
Preserve originalKeep original text in metadataEnables debugging/rollback
Quality scoringMulti-dimensional scoring before embeddingReduces embedding costs by 20-30%
Exact deduplicationHash-based removal (SHA-256)Fast, eliminates true duplicates
Near-deduplicationMinHash with 85% thresholdRemoves variations without false positives
Parallel processingClean documents in parallel5-10x faster for large corpora

Real-World Impact

Case Study: Processing 100,000 web articles

Before cleaning:
- Total documents: 100,000
- Total characters: 500M
- Duplicates: 15,000 (15%)
- Low-quality: 25,000 (25%)

After cleaning:
- Exact duplicates removed: 15,000
- Near-duplicates removed: 8,000
- Low-quality filtered: 25,000
- Final corpus: 52,000 high-quality documents (48% reduction)

Cost savings:
- Embedding API costs: 48% reduction
- Vector storage: 48% reduction
- Query speed: 2x improvement (smaller search space)

2.4 Intelligent Chunking Strategies

Understanding the Chunking Challenge

Chunking is the most critical decision in RAG systems. It determines what information the LLM can access.

Chunk SizeAdvantagesDisadvantagesBest For
Small (128-256 tokens)Precise, fast searchFragmented context, misses relationshipsFAQ, short answers
Medium (512-768 tokens)Balanced precision and recallMay cut sectionsMost content (default)
Large (1024+ tokens)Rich context, complete infoNoisy retrieval, expensiveLong-form narratives

The Chunking Strategy Spectrum

Strategy Comparison

StrategyHow It WorksProsConsBest For
Fixed-sizeSplit every N tokens with overlapSimple, predictable, fastIgnores structure, breaks contextUniform documents (logs, data)
RecursiveTry delimiters hierarchicallyRespects structure, adaptiveMay exceed limitsStructured text (docs, articles)
Structure-AwareDocument-specific splittingPreserves logical unitsFormat-specific implementationCode, Markdown, PDFs
SemanticEmbedding-based boundariesPreserves meaningExpensive (requires embeddings)Complex documents (legal, medical)
HierarchicalParent-child relationshipsMulti-scale retrievalComplex storageLong documents (books, reports)

Delimiter Priority:

  1. \n\n (paragraph breaks) → Preserves sections
  2. . (sentence endings) → Preserves thoughts
  3. (word boundaries) → Last resort before characters
  4. `` (character count) → Final fallback

Why It Works:

  • Structure awareness: Respects paragraphs, sentences, words
  • Adaptive: Tries smart approaches first, falls back to simple
  • No additional cost: No embeddings required
  • Preserves meaning: Maintains sentence and paragraph boundaries

When to Use: Most structured documents (articles, documentation, reports)

Structure-Aware Splitting

Document TypeProblemSolution
CodeBreaks functions/classesSplit by function boundaries, preserve imports
MarkdownIgnores headersPreserve hierarchy (#, ##, ###)
PDFTreats as plain textPage and section-aware splitting
HTMLIgnores DOM structureSplit by semantic elements (article, section)
Tables/JSONBreaks data rowsRow-by-row or entry-by-entry splitting

Semantic Chunking

Concept: Use embeddings to detect topic shifts in documents

Algorithm:

  1. Split text into sentences
  2. Generate embeddings for each sentence
  3. Compute cosine similarity between adjacent sentences
  4. Create new chunk when similarity < threshold (typically 0.80-0.90)
  5. Enforce min/max size constraints

When Semantic Chunking Shines:

  • Legal Contracts: Clauses on different topics (liability, termination, payment)
  • Medical Records: Different conditions, treatments, medications
  • Academic Papers: Related work vs. methodology vs. results
  • News Articles: Multiple topics in single article

Cost-Benefit:

MetricRecursiveSemantic
API CostFree$0.001-0.01 per page
SpeedFastMedium (embedding generation)
Quality⭐⭐⭐⭐⭐⭐⭐⭐

Hierarchical Chunking

Concept: Create parent-child relationships for multi-scale retrieval

Structure:

Parent Chunk (2048 tokens): "Chapter 5: Machine Learning Basics..."
├── Child 1 (512 tokens): "What is supervised learning?"
├── Child 2 (512 tokens): "What is unsupervised learning?"
├── Child 3 (512 tokens): "Common ML algorithms..."
└── Child 4 (512 tokens): "Evaluating ML models..."

Retrieval Strategy:

  1. Search child chunks (small, focused)
  2. When child chunk matches, also retrieve parent (context)
  3. User gets both precise detail + surrounding context

When to Use: Long documents where you need both broad context and precise details (books, long reports)

Decoupled Indexing & Storage

Instead of just splitting text, create multiple representations for different retrieval needs:

TechniqueDescriptionBenefitCost
Summary IndexingStore dense summaries alongside chunksFast overview queries1.1x storage
Hypothetical QuestionsGenerate and store questions each chunk answersMatches user query intent1.2x storage
Multi-VectorStore parent, section, sentence vectorsMulti-granularity retrieval2-3x storage

Strategy Selection Framework

Use CaseRecommended StrategyChunk SizeOverlap
FAQ / Short AnswersFixed-size256-384 tokens20-50
Technical DocumentationRecursive512-768 tokens50-100
Legal ContractsSemanticVariable10-20%
Medical RecordsSemanticVariable10-20%
Long-form ArticlesHierarchicalParent: 2048, Child: 51210%
Code RepositoriesStructure-Aware (by function)Function-level0-20
Books / E-booksHierarchicalParent: 2048, Child: 51210%
Academic PapersRecursive512-768 tokens100
Log FilesFixed-size1024 tokens0

Decision Tree:

Is document structured (sections, paragraphs)?
├─ Yes → Use Recursive
│ └─ Are sections very long (>2000 tokens)?
│ ├─ Yes → Use Hierarchical
│ └─ No → Use Recursive
└─ No → Is content uniform (logs, data)?
├─ Yes → Use Fixed-size
└─ No → Is meaning critical (legal, medical)?
├─ Yes → Use Semantic
└─ No → Use Fixed-size (simpler)

Chunking Best Practices

PracticeImplementationImpact
Overlap (10-20%)Add overlapping tokens between chunksPreserves context boundaries
Respect structureUse recursive for structured docsBetter semantic coherence
Size matters512-768 tokens for most use casesOptimal precision/recall balance
Test and iterateA/B test different chunk sizes20-30% retrieval improvement
Metadata trackingStore chunk strategy and indexEnables analysis and optimization
Hierarchical for long docsParent-child for >5000 token docsMulti-scale retrieval capability

2.5 Metadata Enrichment

Understanding Metadata Power

Metadata is the unsung hero of RAG systems. While embeddings get all the attention, metadata often has a bigger impact on real-world performance:

Metadata BenefitExampleImpact
Pre-filteringFilter by category/year before vector search10-100x faster queries
Context injectionAdd date/source info without retrievalReduced hallucinations
Result rankingSort by relevance (date, views, ratings)Better user experience
Access controlFilter by user permissionsSecurity compliance
DebuggingTrack document sources and processing historyEasier troubleshooting

Real-World Impact Example

Query: "React hooks tutorial"

Without Metadata Filtering:
- Search: All 100K documents
- Time: 2.5 seconds
- Results: Mixed relevance (2020 React, 2024 React Native, 2023 Vue)
- Precision: 65%

With Metadata Filtering (category=react AND year>=2023):
- Search: 5K filtered documents (95% reduction!)
- Time: 0.3 seconds (8x faster!)
- Results: High relevance (2024 React content only)
- Precision: 85%

Result: 8x faster, 20% higher precision, 60% cost reduction

Types of Metadata

Metadata TypeExamplesUse CaseIndex?
Sourcefile path, URL, authorDebugging, citationNo
Temporalcreated_at, updated_at, yearTime-based filteringYes
Categoricalcategory, tags, typePre-filteringYes
Structuralsection, heading, page_numNavigationYes
Qualityscore, grade, reliabilityRankingYes
Hierarchyparent_id, chunk_indexHierarchical retrievalYes
Accessteam, permission, classificationSecurity filteringYes
Statisticsview_count, like_countPopularity rankingYes

Metadata Extraction Strategies

StrategyDescriptionCostExamples
Automatic extractionRule-based pattern matchingLow (fast)Dates, file types, languages
Statistical extractionAnalysis of text propertiesLowCategory detection, audience level
LLM-based extractionAI-powered understandingHigh (API calls)Summaries, topics, sentiment, entities

Automatic Metadata Extraction

Temporal Metadata:

  • Problem: Extract dates for time-sensitive queries (e.g., "latest React features")
  • Solution: Multiple date pattern matching (ISO 8601, US format, European format, relative dates)
  • Approach:
    1. Match multiple date patterns
    2. Parse with multiple formatters
    3. Validate date range (reject invalid dates like 0001-01-01)
    4. Store earliest (creation) and latest (update) dates

Categorical Metadata:

  • Problem: Classify documents into categories for pre-filtering
  • Solution: Keyword-based categorization with scoring
  • Approach:
    1. Define keyword sets per category
    2. Score document by keyword matches
    3. Return highest-scoring category
    4. Extract content type (tutorial, reference, news, blog)
    5. Detect audience level (beginner, intermediate, advanced)

LLM-Based Metadata Extraction

Problem: Extract complex metadata requiring understanding (summaries, topics, sentiment, entities)

Solution: Use LLM with structured output prompt

Prompt Strategy:

Extract structured metadata from this text:
- title: Short, descriptive
- summary: One-sentence summary
- topics: 3-5 main topics
- entities: Important names, places, organizations
- sentiment: positive/neutral/negative
- urgency: low/medium/high

Return ONLY valid JSON.

Cost-Benefit:

AspectAutomatic Only+ LLM Extraction
API CostFree$0.01-0.05 per document
QualityGood for structured dataBetter for nuanced understanding
Best ForHigh-volume processingHigh-value documents

Metadata Schema Design

PracticeWhyImplementation
Use consistent namingPredictable queryingsnake_case for all keys
Index filterable fieldsFast pre-filteringCreate indexes on category, date, tags
Avoid high-cardinality fieldsReduces index sizeDon't index source, document_id
Use typed valuesType-safe queriesNumbers for counts, strings for names
Document schemaTeam collaborationMaintain schema documentation
Version metadataEnables migrationsTrack schema version

Metadata Best Practices

PracticeImplementationImpact
Extract at load timeParse dates, categories during loadingNo re-processing needed
Enrich with LLMs selectivelyUse LLM extraction for high-value docs onlyOptimize costs
Index filterable fieldsCreate database indexes on metadata fields10-100x faster pre-filtering
Use consistent namingsnake_case, documented schemaPredictable querying
Track metadata qualityMonitor extraction success/failure ratesContinuous improvement

2.6 Complete Data Processing Pipeline

End-to-End Architecture

The complete pipeline combines all stages into a production-ready system:

┌─────────────────────────────────────────────────────────────────┐
│ INPUT: Raw Documents │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: Document Loading │
│ - Multi-format support (PDF, HTML, MD, DOCX) │
│ - Automatic format detection │
│ - Comprehensive metadata (source, type, size, hash) │
│ Output: List<Document> with basic metadata │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: Data Cleaning │
│ - Remove noise (control chars, encoding issues) │
│ - Normalize whitespace │
│ - Remove URLs/emails │
│ Output: Cleaned<Document> with cleaning_stats │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 3: Quality Assessment │
│ - Multi-dimensional scoring (length, meaningful, structure) │
│ - Filter below threshold (≥ 0.5) │
│ Output: Filtered<Document> with quality_score │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 4: Deduplication │
│ - Exact duplicates (hash-based, SHA-256) │
│ - Near-duplicates (MinHash, 85% threshold) │
│ Output: Unique<Document> with dedup_stats │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 5: Intelligent Chunking │
│ - Recursive splitting (512 tokens, 20% overlap) │
│ - Respect document structure │
│ Output: List<Chunk> with chunk_index, chunk_total │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 6: Metadata Enrichment │
│ - Automatic: Dates, categories, language │
│ - LLM-based: Summaries, topics (selective) │
│ Output: Enriched<Chunk> with rich metadata │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 7: Embedding Generation │
│ - Batch processing (reduce API calls) │
│ - Caching (80% cost savings for repetitive content) │
│ Output: EmbeddedChunk with vector[] │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ STAGE 8: Vector Storage │
│ - HNSW indexing (fast approximate search) │
│ - Metadata indexing (enable pre-filtering) │
│ Output: Stored documents ready for retrieval │
└───────────────────────────┬─────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│ OUTPUT: Ready for Retrieval │
└─────────────────────────────────────────────────────────────────┘

Pipeline Optimization Strategies

OptimizationTechniqueImpactWhen to Use
Parallel processingProcess documents concurrently5-10x fasterLarge document sets
Embedding cachingCache repetitive content80% cost savingsRe-processing documents
Batch embeddingGroup multiple embeddings3-5x faster, lower costAlways enable
Rate limitingControl API request rateAvoid throttlingHigh-volume processing
Incremental updatesOnly process changed documentsFaster re-indexingFrequently updated corpora
Quality pre-filteringFilter early, not lateReduce downstream processingBefore expensive operations

Error Handling Strategy

StageCommon FailuresHandling ApproachRecovery
LoadingCorrupted files, encoding issuesLog error, skip fileManual review of failed files
CleaningRegex errors, memory issuesContinue with next cleanerReduce batch size
QualityEmpty documents after cleaningFilter out gracefullyAdjust quality thresholds
ChunkingDocuments too small/largeUse fallback strategyAdjust chunk parameters
EmbeddingAPI rate limits, network errorsRetry with exponential backoffQueue for retry
StorageDatabase connection issuesTransaction rollbackRetry with backoff

2.7 Performance Optimization

Performance Bottlenecks

StageTypical BottleneckOptimization Strategy
LoadingI/O bound, single-threadedParallel file reading
CleaningCPU-intensive regexParallel processing
ChunkingRecursive algorithm overheadCache embeddings for semantic
EmbeddingAPI latency, rate limitsBatch processing, caching
StorageNetwork I/O, indexingBulk inserts, async writes

Parallel Processing

Problem: Sequential processing is slow for large document sets

Solution: Parallel processing across CPU cores

ApproachSpeedupComplexityBest For
Parallel streams4-8x (depends on cores)LowDocument-level parallelization
Thread pools5-10xMediumFine-grained control
Distributed processingNear-linearHighVery large corpora (1M+ docs)

Key Considerations:

  • Thread safety: Ensure document readers are thread-safe
  • Memory usage: More threads = more memory
  • Rate limiting: Control concurrent API requests
  • Error handling: Isolate failures to specific documents

Embedding Cost Optimization

TechniqueDescriptionSavingsTrade-off
Batch processingGroup multiple embeddings per API call50-70%Slightly higher latency
CachingStore embeddings, reuse for identical text80%Memory usage
Model selectionUse cheaper models (MiniLM vs. OpenAI)90%Slightly lower quality
Deduplication firstRemove duplicates before embedding15-30%MinHash computation cost
Quality filteringFilter low-quality docs first20-30%Need quality scoring

Monitoring & Metrics

MetricWhy MonitorTargetAlert Threshold
Processing throughputTrack pipeline speed> 100 docs/min< 50 docs/min
Embedding API latencyDetect slowdowns< 500ms average> 2000ms
Error rateCatch systematic issues< 1%> 5%
Cache hit rateValidate caching effectiveness> 50%< 30%
Quality score distributionEnsure filtering worksMean > 0.6Mean < 0.4

2.8 Best Practices & Common Pitfalls

Production Checklist

AspectBest PracticeWhyImplementation
Chunk Size512-768 tokensOptimal balanceDefault to recursive chunking
Overlap10-20% of chunk sizePreserve contextSet overlap=100 for 512 tokens
MetadataIndex filterable fieldsPre-filtering powerAdd indexes on category, year, tags
DeduplicationMinHash with 0.85 thresholdRemove near-duplicatesRun after quality filtering
Quality FilterScore threshold ≥ 0.5Filter low-qualityUse multi-dimensional scoring
Embedding ModelUse cached, batch requestsReduce API costEnable caching with TTL
Vector StorageHNSW index with M=16Fast searchCreate indexes after bulk load
Error HandlingSkip failed files, log errorsRobust pipelinetry-catch with detailed logging
Parallel ProcessingUse parallel streams5-10x fasterFor document-level operations
MonitoringTrack processing metricsDebuggingLog counts, durations, errors

Common Anti-Patterns

Anti-PatternProblemSolution
Chunking by character onlySplits mid-sentence, breaks contextUse recursive splitter with delimiters
No metadata filteringExpensive vector search on entire corpusAdd metadata filters before vector search
Re-embed duplicate textsWasted API costsCache embeddings with hash key
Fixed-size for all docsBreaks structure, ignores content typeUse structure-aware for code/docs
No quality filteringGarbage in, garbage outFilter by quality score ≥ 0.5
Sequential processingSlow, doesn't utilize hardwareUse parallel streams for document ops
Ignoring embeddings costCan exceed budget quicklyCache, batch, and deduplicate first
No error recoveryOne bad file stops entire pipelineCatch exceptions, continue processing
No monitoringCan't detect performance issuesTrack metrics and set alerts

Good vs. Bad Practices Comparison

Real-World Optimization Case Study

Enterprise Knowledge Base (100,000 documents)

Before Optimization:
- Documents: 100,000 (unfiltered)
- Duplicates: 20,000 (20%)
- Low-quality: 15,000 (15%)
- Chunks: 500,000 (256 tokens each)
- Embedding cost: $500/month
- Query latency: 2.5 seconds
- Retrieval precision: 65%

After Optimization:
- Documents: 65,000 (after quality + dedup)
- Chunks: 150,000 (512 tokens, recursive)
- Embedding cost: $150/month (70% reduction)
- Query latency: 0.4 seconds (6x faster)
- Retrieval precision: 82% (26% improvement)

ROI: 6x faster queries, 70% cost reduction, 26% accuracy improvement

2.9 Interview Q&A

Q1: How to choose the optimal chunk size for a RAG system?

Key Considerations:

  1. Document Type:

    • FAQ/short answers: 256-384 tokens (high precision)
    • Technical docs: 512-768 tokens (balance context and precision)
    • Legal/medical: Semantic chunking (preserve meaning over size)
    • Books/reports: Hierarchical (parent 2048, child 512)
  2. Query Type:

    • Factoid queries ("What is X?"): Smaller chunks (256-384)
    • Explanatory queries ("How does X work?"): Medium chunks (512-768)
    • Context-heavy ("Summarize this document"): Large chunks (1024+)
  3. Testing Approach: A/B test different chunk sizes, measure precision and recall

Rule of Thumb: Start with 512 tokens, 20% overlap. Optimize based on retrieval metrics.

Q2: Why is metadata filtering important in RAG systems?

Performance Impact:

  1. Search Space Reduction: 10x faster (100K docs → 10K docs with metadata filter)
  2. Precision Improvement: 20% higher (65% → 85% with year/category filters)
  3. Cost Reduction: 30-60% (fewer vector similarity calculations)

Key Insight: Metadata filtering is the most cost-effective optimization in RAG systems.

Q3: How to handle duplicate documents in RAG corpus?

Three Levels of Deduplication:

LevelTechniquePrecisionCostRecommendation
1Exact duplicates (hash-based)100%LowAlways use
2Near-duplicates (MinHash 0.85)~85%MediumProduction standard
3Semantic duplicates (embeddings)~95%HighSmall corpora only

Impact: Typical corpus sees 15-30% duplicates removed, with 15-30% storage savings and query speedup.

Q4: When should I use semantic vs. recursive chunking?
FactorRecursive ChunkingSemantic Chunking
CostFree$0.001-0.01 per page
SpeedFastMedium (embedding generation)
Quality⭐⭐⭐⭐⭐⭐⭐⭐
Best ForDefault choiceComplex, nuanced content

Use semantic when: Document value justifies cost, semantic boundaries are critical (legal, medical, financial)

Q5: How do you optimize embedding costs for large corpora?

Optimization Strategies (in order of impact):

  1. Deduplication first: Remove 15-30% of content before embedding
  2. Quality filtering: Filter low-quality docs (20-30% reduction)
  3. Batch processing: Group multiple embeddings (50-70% savings)
  4. Caching: Cache repetitive content (80% savings on re-processing)
  5. Model selection: Use cheaper models (90% cost reduction, slight quality trade-off)

Real-world impact: Typical cost reduction of 70-90% with minimal quality loss.


Chapter Summary

Key Takeaways

1. Document Loading & Parsing:

  • Multi-format support (PDF, HTML, Markdown, DOCX)
  • Format-specific challenges and solutions
  • Comprehensive metadata (source, type, size, hash) enables downstream filtering

2. Data Cleaning:

  • Pipeline approach with chain-of-responsibility pattern
  • Multi-dimensional quality scoring (length, meaningful content, structure, diversity)
  • Exact and near-deduplication (MinHash with 85% threshold)

3. Intelligent Chunking:

  • Recursive chunking as default (512-768 tokens, 10-20% overlap)
  • Structure-aware for code, Markdown, PDFs
  • Semantic chunking for complex documents (when cost justifies)
  • Hierarchical for multi-scale retrieval needs

4. Metadata Enrichment:

  • Automatic extraction (dates, categories, language)
  • LLM-based extraction (summaries, topics, sentiment)
  • Rich metadata enables 10-100x faster pre-filtering

5. Embedding & Storage:

  • Batch processing reduces API calls by 50-70%
  • Caching provides 80% cost savings for repetitive content
  • HNSW indexing for fast approximate search

6. Performance Optimization:

  • Parallel processing: 5-10x speedup
  • Quality filtering: 20-30% cost reduction
  • Embedding caching: 80% savings on re-processing

Next Steps

Continue Learning:

Practice Projects:

  • Build a technical documentation RAG system
  • Implement semantic chunking for legal contracts
  • Create a hierarchical chunking system for books
  • Optimize embedding costs with caching

Production Checklist:

  • Implement quality filtering (score ≥ 0.5)
  • Add metadata extraction (temporal, categorical)
  • Enable deduplication (exact + near)
  • Use recursive chunking (512 tokens, 20% overlap)
  • Cache embeddings with 7-day TTL
  • Create HNSW indexes (M=16, ef=100)
  • Add monitoring (processing metrics, query stats)