7. Advanced RAG Techniques
"The future of RAG is not retrieval alone, but adaptive intelligence that knows when to retrieve, how to reason, and what to optimize." â Advanced RAG Principle
This chapter covers frontier RAG techniques that solve specific production challenges: modular architectures (dynamic routing, iterative retrieval), knowledge graph integration (GraphRAG), agentic systems (Self-RAG, Corrective RAG), fine-tuning fusion (domain adaptation, RAFT), and performance optimization (caching, quantization).
7.1 RAG Challenges & Decision Matrixâ
7.1.1 The Production Gapâ
Basic RAG systems suffer from fundamental limitations in production environments:
The Six Fundamental Challenges:
| Challenge | Symptom | Root Cause | Impact |
|---|---|---|---|
| Hallucination | Answer invents facts not in documents | LLM relies on pre-training instead of context | Loss of trust, legal risk |
| Poor Retrieval Accuracy | Retrieved documents miss key information | Single-hop vector search insufficient | Incomplete answers |
| High Latency & Cost | Slow responses, expensive API calls | All queries get full processing | Poor UX, budget overruns |
| Limited Reasoning Depth | Cannot connect multiple facts | No multi-hop inference | Shallow answers |
| Domain Adaptation Gap | General embeddings miss domain jargon | Models trained on general corpus | Poor retrieval in specialized fields |
| Rigid Linear Pipeline | Simple and complex queries treated equally | Fixed retrieve-then-generate flow | Inefficiency, wasted compute |
7.1.2 Decision Matrixâ
Which technique solves which problem?
Comprehensive Decision Matrix:
| Problem | Primary Solution | Production Readiness | Complexity | Expected Improvement | Use Case |
|---|---|---|---|---|---|
| Static pipeline inefficiency | Modular RAG | High | Medium | 30-40% cost reduction | Mixed query complexity |
| Multi-hop reasoning failure | GraphRAG | Very High | High | 2-3x MRR improvement | Complex relationships |
| Hallucination | Agentic RAG | Medium-High | High | 15-20% accuracy gain | Accuracy-critical apps |
| Domain knowledge gap | RAG + Fine-tuning | High | Very High | 20-30% domain QA gain | Specialized domains |
| Performance/cost issues | Optimization | Very High | Low-Medium | 90% latency reduction | All production systems |
| Low retrieval precision | Hybrid (Modular + Graph) | Medium | Very High | Combined benefits | Enterprise-grade systems |
2025 Insight: Technique Combination
Research shows that combining techniques yields better results than any single approach:
- Modular RAG + GraphRAG â Adaptive + relational reasoning
- Agentic RAG + Fine-tuning â Self-reflective + domain expertise
- Optimization + Any technique â Production-ready performance
7.2 Modular RAG - From Linear to Adaptiveâ
7.2.1 Paradigm Shift: Linear vs Modularâ
Linear RAG treats all queries identically:
- Query â Embed â Retrieve â Generate
- One-size-fits-all pipeline
- Wastes compute on simple queries
Modular RAG adapts to query characteristics:
- Query â Analyze â Route â Specialized Processing â Generate
- Different paths for different needs
- Efficient resource allocation
Key Differences:
| Aspect | Linear RAG | Modular RAG |
|---|---|---|
| Pipeline | Fixed retrieve-then-generate | Dynamic routing based on query |
| Query Analysis | None | Classification before routing |
| Retrieval Strategy | Always vector search | Vector, Graph, Web, or None |
| Compute Efficiency | Low (all queries get full processing) | High (tailored processing) |
| Latency | Uniform (often slow) | Variable (fast for simple queries) |
| Cost | High (unnecessary operations) | Optimized (minimal operations) |
7.2.2 Dynamic Routing (Semantic Router)â
Concept: Classify queries and route to specialized processing paths.
Implementation in Java/Spring Boot:
// Semantic Router Service
@Service
public class SemanticRouter {
private final ChatModel llm;
private final VectorStore vectorStore;
private final GraphStore graphStore;
private final WebSearchService webSearch;
public enum Route {
VECTOR_DB, // Factual questions needing document retrieval
GRAPH_DB, // Relational questions requiring multi-hop reasoning
WEB_SEARCH, // Questions needing recent information
LLM_DIRECT // Common sense or general knowledge
}
public QueryRouteResult routeQuery(String query) {
// Step 1: Classify query using LLM
String classificationPrompt = """
Classify this query into one of these categories:
Query: %s
Categories:
1. VECTOR_DB - Factual questions about specific documents, procedures, or facts
2. GRAPH_DB - Questions requiring multi-hop reasoning, relationships, or connections
3. WEB_SEARCH - Questions about recent events, current prices, or time-sensitive data
4. LLM_DIRECT - Common sense questions, general knowledge, or conversational queries
Output only the category name.
""".formatted(query);
String category = llm.call(classificationPrompt);
Route route = Route.valueOf(category);
// Step 2: Extract confidence score
double confidence = extractConfidence(query, route);
return new QueryRouteResult(route, confidence);
}
public String processQuery(String query) {
QueryRouteResult routing = routeQuery(query);
return switch (routing.route()) {
case VECTOR_DB -> handleVectorRetrieval(query);
case GRAPH_DB -> handleGraphRetrieval(query);
case WEB_SEARCH -> handleWebSearch(query);
case LLM_DIRECT -> handleDirectLLM(query);
};
}
private String handleVectorRetrieval(String query) {
// Vector similarity search
List<Document> docs = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);
// Generate answer with retrieved context
return generateWithRAG(query, docs);
}
private String handleGraphRetrieval(String query) {
// Extract entities from query
List<String> entities = extractEntities(query);
// Graph traversal for multi-hop reasoning
List<GraphPath> paths = graphStore.findPaths(
entities,
maxDepth = 3,
maxPaths = 10
);
// Convert paths to context
String context = formatGraphContext(paths);
// Generate answer with graph context
return generateWithRAG(query, context);
}
private String handleWebSearch(String query) {
// Live web search for recent information
List<WebResult> results = webSearch.search(query, maxResults = 5);
String context = formatWebResults(results);
return generateWithRAG(query, context);
}
private String handleDirectLLM(String query) {
// Direct LLM call without retrieval
String prompt = "Answer this question directly: %s".formatted(query);
return llm.call(prompt);
}
private double extractConfidence(String query, Route route) {
// Use LLM to score routing confidence
String confidencePrompt = """
Query: %s
Suggested Route: %s
Score your confidence in this routing decision (0.0 to 1.0):
- 1.0: Very confident this is the correct route
- 0.5: Moderately confident
- 0.0: Not confident at all
Output only the score.
""".formatted(query, route);
String response = llm.call(confidencePrompt);
return Double.parseDouble(response.trim());
}
}
// Route Result Record
record QueryRouteResult(SemanticRouter.Route route, double confidence) {
public boolean isConfident() {
return confidence >= 0.7;
}
}
Routing Strategies Table:
| Query Pattern | Route | Confidence Threshold | Processing |
|---|---|---|---|
| "How do I configure..." | VECTOR_DB | > 0.7 | Vector search + RAG |
| "What is the relationship between..." | GRAPH_DB | > 0.7 | Graph traversal + RAG |
| "What is the current price of..." | WEB_SEARCH | > 0.7 | Web search + RAG |
| "Tell me a joke" | LLM_DIRECT | > 0.7 | Direct LLM response |
| Ambiguous / Low confidence | Fallback | < 0.7 | Multiple routes + ensemble |
Tools & Libraries:
| Tool | Language | Features | Integration |
|---|---|---|---|
| LangRouter | Python | Semantic routing, confidence scoring | LangChain |
| Semantic Router | Python | Fast semantic routing | LlamaIndex, LangChain |
| Spring AI | Java | Native routing support | Spring Boot |
| Custom Implementation | Any | Full control | Any framework |
7.2.3 Iterative Retrieval (ITER-RETGEN)â
Problem: Single retrieval is often insufficient for complex, multi-part questions.
Solution: ITER-RETGEN (Iterative Retrieval-Generation) â Multi-round retrieval with answer refinement.
Algorithm:
- Generate initial answer with available context
- Identify information gaps (what's missing?)
- Generate new query based on gaps
- Retrieve additional documents
- Refine and complete answer
- Repeat until satisfied
Implementation in Java:
@Service
public class IterativeRetrievalService {
private final ChatModel llm;
private final VectorStore vectorStore;
private static final int MAX_ITERATIONS = 3;
public String iterativeRetrieve(String query) {
StringBuilder answer = new StringBuilder();
Set<String> retrievedDocIds = new HashSet<>();
int iteration = 0;
while (iteration < MAX_ITERATIONS) {
iteration++;
// Step 1: Retrieve documents
List<Document> docs = retrieveDocuments(query, retrievedDocIds);
// Step 2: Generate/refine answer
String currentAnswer = generateAnswer(query, docs, answer.toString());
// Step 3: Check for information gaps
GapAnalysisResult gaps = analyzeGaps(query, currentAnswer);
if (!gaps.hasGaps()) {
// Satisfied with current answer
return currentAnswer;
}
// Step 4: Generate follow-up query based on gaps
query = generateFollowUpQuery(query, currentAnswer, gaps);
// Track retrieved documents to avoid duplicates
docs.forEach(d -> retrievedDocIds.add(d.getId()));
answer = new StringBuilder(currentAnswer);
}
return answer.toString();
}
private List<Document> retrieveDocuments(String query, Set<String> excludeIds) {
List<Document> allDocs = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(10)
);
// Filter out already retrieved documents
return allDocs.stream()
.filter(doc -> !excludeIds.contains(doc.getId()))
.limit(5)
.toList();
}
private String generateAnswer(String query, List<Document> docs, String previousAnswer) {
String context = docs.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n"));
String prompt;
if (previousAnswer.isEmpty()) {
// Initial generation
prompt = """
Answer this question using the provided context:
Question: %s
Context:
%s
Provide a comprehensive answer. If the context is insufficient,
explicitly state what information is missing.
""".formatted(query, context);
} else {
// Refinement
prompt = """
Previous Answer:
%s
Additional Context:
%s
Original Question: %s
Refine the answer by incorporating the additional context.
Address any gaps in the previous answer.
""".formatted(previousAnswer, context, query);
}
return llm.call(prompt);
}
private GapAnalysisResult analyzeGaps(String query, String answer) {
String prompt = """
Analyze this answer for information gaps:
Question: %s
Answer: %s
Identify what information is missing or incomplete.
Output format:
HAS_GAPS: true/false
GAPS: [list of missing information]
Be conservative - only mark gaps if critical information is truly missing.
""".formatted(query, answer);
String response = llm.call(prompt);
return parseGapAnalysis(response);
}
private String generateFollowUpQuery(String originalQuery, String currentAnswer, GapAnalysisResult gaps) {
String prompt = """
Original Question: %s
Current Answer: %s
Identified Gaps: %s
Generate a follow-up search query to find the missing information.
Output only the search query.
""".formatted(originalQuery, currentAnswer, gaps.description());
return llm.call(prompt).trim();
}
}
// Gap Analysis Result
record GapAnalysisResult(boolean hasGaps, List<String> missingInfo) {
public String description() {
return String.join(", ", missingInfo);
}
}
Use Cases:
- Multi-part questions ("Compare X and Y, then recommend which is better for Z")
- Exploratory research ("Tell me about [topic], starting with basics and going deeper")
- Complex troubleshooting ("Debug this error: First check common causes, then rare ones")
Performance:
- 2-3x improvement in answer completeness
- 40-60% increase in latency (trade-off vs quality)
- Best for low-volume, high-value queries
7.3 GraphRAG - Knowledge Graph Enhancedâ
7.3.1 Why Graphs Complement Vectorsâ
Vector databases find similarity (semantic closeness):
- "DNS" is similar to "network configuration"
- Great for: Factual retrieval
Knowledge graphs find relationships (structural connections):
- "DNS" â "depends on" â "routing"
- Great for: Multi-hop reasoning
Complementary Strengths:
| Aspect | Vector DB | Knowledge Graph | Combined |
|---|---|---|---|
| Finds | Similar content | Related entities | Both |
| Best for | Factual queries | Relational queries | Complex queries |
| Reasoning | Single-hop | Multi-hop | Adaptive |
| Example | "How to configure DNS" | "What does DNS depend on?" | Both scenarios |
7.3.2 Graph Constructionâ
Technique: Extract (Entity, Relation, Entity) triples from unstructured text.
Process:
- Parse documents with LLM
- Identify entities (people, concepts, technologies)
- Extract relations (verbs, dependencies)
- Store as knowledge graph
Entity Extraction in Java:
@Service
public class KnowledgeGraphService {
private final ChatModel llm;
private final GraphStore graphStore;
public void buildGraphFromDocuments(List<Document> documents) {
for (Document doc : documents) {
List<Triple> triples = extractTriples(doc);
storeTriples(triples);
}
}
private List<Triple> extractTriples(Document doc) {
String prompt = """
Extract entity-relationship-entity triples from this text:
Text: %s
Extract triples in the format:
SUBJECT | PREDICATE | OBJECT
Examples:
- DNS | uses | TCP port 53
- AdGuard | configures | DNS
- Router | forwards | DNS queries
Only extract clear, factual relationships.
""".formatted(doc.getContent());
String response = llm.call(prompt);
return parseTriples(response);
}
private List<Triple> parseTriples(String response) {
return response.lines()
.filter(line -> line.contains("|"))
.map(line -> {
String[] parts = line.split("\\|");
return new Triple(
parts[0].trim(),
parts[1].trim(),
parts[2].trim()
);
})
.toList();
}
private void storeTriples(List<Triple> triples) {
for (Triple triple : triples) {
// Create or get nodes
Node subject = graphStore.getOrCreateNode(triple.subject());
Node object = graphStore.getOrCreateNode(triple.object());
// Create relationship
Relationship rel = subject.createRelationshipTo(
object,
triple.predicate()
);
graphStore.save(rel);
}
}
public List<GraphPath> findPaths(String startEntity, String endEntity, int maxDepth) {
return graphStore.findPaths(
startEntity,
endEntity,
maxDepth,
limit = 10
);
}
}
// Triple Record
record Triple(String subject, String predicate, String object) {}
Triple Extraction Patterns:
| Pattern | Example | Subject | Predicate | Object |
|---|---|---|---|---|
| Technology Dependency | "DNS uses TCP port 53" | DNS | uses | TCP port 53 |
| Configuration | "AdGuard configures DNS" | AdGuard | configures | DNS |
| Causality | "Misconfigured DNS causes resolution failure" | Misconfigured DNS | causes | resolution failure |
| Part-Of | "TCP is part of the network stack" | TCP | part-of | network stack |
| Location | "Config file is in /etc/dns" | Config file | located-in | /etc/dns |
Tools for Graph Construction:
| Tool | Language | Type | Best For |
|---|---|---|---|
| Neo4j | Cypher, Java | Graph Database | Production systems |
| NebulaGraph | nGQL, Java | Distributed Graph | Large-scale graphs |
| Microsoft GraphRAG | Python | Automated Pipeline | Quick setup |
| NetworkX | Python | Library | Prototyping |
7.3.3 Graph+Vector Hybrid Retrievalâ
Algorithm:
- Vector Search â Find anchor entities
- Graph Traversal â Get 2-hop neighbor nodes
- Merge â Combine text chunks + relation paths
- Feed to LLM with combined context
Implementation:
@Service
public class HybridRetrievalService {
private final VectorStore vectorStore;
private final GraphStore graphStore;
public HybridRetrievalResult hybridSearch(String query) {
// Phase 1: Vector search for anchor documents
List<Document> anchorDocs = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);
// Phase 2: Extract entities from anchor docs
Set<String> entities = extractEntities(anchorDocs);
// Phase 3: Graph traversal for related entities
List<GraphPath> paths = new ArrayList<>();
for (String entity : entities) {
List<GraphPath> entityPaths = graphStore.findNeighbors(
entity,
maxDepth = 2,
limit = 5
);
paths.addAll(entityPaths);
}
// Phase 4: Merge context
String combinedContext = buildCombinedContext(anchorDocs, paths);
return new HybridRetrievalResult(combinedContext, anchorDocs, paths);
}
private Set<String> extractEntities(List<Document> docs) {
Set<String> entities = new HashSet<>();
for (Document doc : docs) {
// Extract entities from document metadata or content
entities.addAll(doc.getMetadata()
.getOrDefault("entities", List.of())
.stream()
.map(Object::toString)
.toList());
}
return entities;
}
private String buildCombinedContext(List<Document> docs, List<GraphPath> paths) {
StringBuilder context = new StringBuilder();
context.append("=== Relevant Documents ===\n");
for (Document doc : docs) {
context.append(doc.getContent()).append("\n\n");
}
context.append("\n=== Related Entity Paths ===\n");
for (GraphPath path : paths) {
context.append(path.format())
.append("\n");
}
return context.toString();
}
}
// Hybrid Retrieval Result
record HybridRetrievalResult(
String combinedContext,
List<Document> documents,
List<GraphPath> graphPaths
) {}
Performance:
- 2-3x improvement on multi-hop queries
- Comparable performance on simple queries
- Best for: "How does X affect Y?" type questions
7.3.4 Community Summary (Microsoft GraphRAG)â
Concept: Pre-generate summaries for node communities to handle macro-questions.
Process:
- Detect communities (clusters of related entities)
- Summarize each community
- Index community summaries
- For macro-questions ("What trends in this doc?"), retrieve summaries
Local vs Global Retrieval:
| Aspect | Local Retrieval | Global Retrieval |
|---|---|---|
| Scope | Specific entities | Community summaries |
| Query Type | "What does X do?" | "What are the trends?" |
| Context Size | Small (specific) | Large (summarized) |
| Token Efficiency | Baseline | 97% reduction |
Implementation:
@Service
public class CommunitySummaryService {
private final GraphStore graphStore;
private final ChatModel llm;
public void buildCommunitySummaries() {
// Step 1: Detect communities
List<Community> communities = graphStore.detectCommunities(
algorithm = "Louvain"
);
// Step 2: Summarize each community
for (Community community : communities) {
String summary = summarizeCommunity(community);
community.setSummary(summary);
graphStore.save(community);
}
}
private String summarizeCommunity(Community community) {
String entitiesText = community.getEntities().stream()
.map(Node::getLabel)
.collect(Collectors.joining(", ""));
String relationsText = community.getRelations().stream()
.map(Relationship::toString)
.collect(Collectors.joining("\n"));
String prompt = """
Summarize this community of entities and their relationships:
Entities: %s
Relationships:
%s
Provide a concise summary (2-3 sentences) describing:
1. What this community is about
2. Key patterns or themes
3. How entities relate to each other
""".formatted(entitiesText, relationsText);
return llm.call(prompt);
}
public List<String> retrieveCommunitySummaries(String query) {
// Embed query
float[] queryEmbedding = embed(query);
// Find similar community summaries
List<Community> similarCommunities = graphStore.searchSummaries(
queryEmbedding,
topK = 3
);
return similarCommunities.stream()
.map(Community::getSummary)
.toList();
}
}
Use Case: Macro-questions that require understanding themes, not specific facts.
Contextual Retrieval (Anthropic, 2024)â
Anthropic æåēį Contextual Retrieval æšæŗæžčæåäēæŖį´ĸč´¨éīŧ
æ ¸åŋææŗ: 卿¯ä¸Ē chunk 忎ģå ææĄŖįē§ä¸ä¸æīŧäŊŋ embedding å BM25 į´ĸåŧæ´åįĄŽã
# Contextual Chunk Generation
def create_contextual_chunk(document: str, chunk: str) -> str:
prompt = f"""<document>
{document}
</document>
Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
context = llm.generate(prompt)
return f"{context}\n\n{chunk}" # Prepend context to chunk
ææ: å¨å¤ä¸Ēåēåæĩč¯ä¸īŧContextual Retrieval å°æŖį´ĸå¤ąč´ĨįéäŊäē 49%īŧcombined with BM25īŧå° 67%īŧcombined with embeddingsīŧã
LazyGraphRAG (Microsoft, 2024)â
LazyGraphRAG æ¯ GraphRAG įčŊģéįē§æŋäģŖæšæĄīŧå¤§åš éäŊææŦīŧ
- ææŦéäŊ 700x+: ä¸éčĻæč´ĩį LLM æŊåīŧäŊŋ፠NLP ææ¯čŋčĄåŽäŊåå ŗįŗģæŊå
- æĨč¯ĸæļåžæåģē: 卿Ĩč¯ĸæļææåģēį¸å ŗååžīŧčééĸå æåģēå ¨åž
- éåå翝: ä¸å°č§æ¨ĄææĄŖéãéĸįŽåéį饚įŽ
LightRAGâ
LightRAG æ¯åĻä¸ä¸ĒčŊģéįē§åžæŖį´ĸæšæĄīŧ
- ååąæŖį´ĸ: åæļæ¯æå ˇäŊäēåŽæĨč¯ĸåæŊ蹥æĻåŋĩæĨč¯ĸ
- åĸéæ´æ°: æ¯ææ°ææĄŖåĸéæå Ĩīŧæ ééåģēį´ĸåŧ
- æ´äŊčĩæēéæą: ᏿¯ GraphRAG æžčéäŊ莥įŽååå¨éæą
Multimodal RAGâ
æŖį´ĸåįæä¸äģ éäēææŦīŧčŋæ¯æåžįã襨æ ŧãææĄŖįéĸīŧ
| Method | Description | Best For |
|---|---|---|
| Text-based | OCR + text RAG | Scanned documents |
| ColPali | Visual token matching | Document screenshots, PDFs |
| Multimodal Embedding | Image+text joint embeddings | Mixed media collections |
| Native Multimodal | Send images directly to VLM | Complex visual content |
# ColPali: Vision-based document retrieval
from colpali_engine.models import ColPali, ColPaliProcessor
model = ColPali.from_pretrained("vidore/colpali-v1.2")
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")
# Index document pages as images
images = [load_image(page) for page in document_pages]
embeddings = model.forward(processor.process_images(images))
# Query
query_embedding = model.forward(processor.process_queries(["What is the revenue?"]))
scores = embeddings @ query_embedding.T
Agentic RAG (čĄĨå )â
å° Agent įčĒä¸ģåŗįčŊåčå Ĩ RAG æĩį¨īŧ
æ ¸åŋčŊå:
- čĒéåēæŖį´ĸ: Agent æ šæŽæĨč¯ĸįąģåčĒå¨éæŠæŖį´ĸįįĨ
- čŋäģŖæŖį´ĸ: åĻæéĻæŦĄæŖį´ĸįģæä¸å¤īŧAgent å¯äģĨæšåæĨč¯ĸéæ°æŖį´ĸ
- 夿ēčå: åæļäŊŋį¨åéæį´ĸãå ŗéŽč¯æį´ĸãįĨč¯åžč°ąãSQLæĨč¯ĸ
- čĒæč¯äŧ°: Agent 夿æŖį´ĸįģææ¯åĻčļŗäģĨåįéŽéĸ
CorpusGraphīŧarXiv:2604.14572īŧæåēäēä¸į§å ¨æ°į Agentic RAG čåŧīŧ፠Agent ä¸ģå¨å¯ŧčĒäŧä¸įĨč¯č¯æåēīŧčéčĸĢ卿Ŗį´ĸãäŧ įģ RAG įŗģįģæ æŗåæē¯æįģååæŖį蝿ŽįæŽĩīŧCorpusGraph 莊 Agent åĻäš įģįģįģæīŧåäēēįąģį įŠļå䏿 ˇå¨ææĄŖé´čˇŗčŊŦãåæē¯ãæ´åäŋĄæ¯ãčŋ䏿莝æŖå¨æ¨å¨ Agentic RAG äģ"æŖį´ĸåĸåŧē"å"įĨč¯å¯ŧčĒ"æŧčŋã
UniDoc-RLīŧarXiv:2604.14967īŧåå° RL åŧå Ĩč§č§ RAGīŧéčŋååąå¨äŊåå¯éåĨåąč§Ŗåŗįģį˛åēĻč§č§č¯äšį夿æ¨įéŽéĸīŧå°å¤é¨č§č§įĨč¯čå Ĩ大č§č§č¯č¨æ¨Ąåã
7.4 Agentic RAG - Autonomous Reasoningâ
7.4.1 From Reader to Researcherâ
Agentic RAG: LLM with tool use and self-reflection capabilities.
Shift: Passive responder â Active researcher
Key Capabilities:
- Tool Use: Can call external tools (retrieval, calculator, code executor)
- Self-Reflection: Evaluates own outputs for quality
- Iteration: Improves answers through multiple rounds
- Memory: Maintains conversation context
7.4.2 Self-RAG (Self-Reflective RAG)â
Research: Self-RAG (Asai et al., 2023, ICLR 2024)
Core Mechanism: Self-scoring during generation with reflection tokens.
Reflection Tokens:
Need Retrieval?- Should I search for information?Is Relevant?- Is retrieved content useful?Is Supported?- Is answer grounded in evidence?
Reflection Token Thresholds:
| Reflection Token | Question | Threshold | Action |
|---|---|---|---|
Need Retrieval? | Do I need external information? | > 0.7 | Trigger retrieval |
Is Relevant? | Is retrieved context useful? | > 0.6 | Use context |
Is Supported? | Is claim grounded in evidence? | > 0.8 | Include claim |
Implementation:
@Service
public class SelfRAGService {
private final ChatModel llm;
private final VectorStore vectorStore;
public String selfRAG(String query) {
// Reflection 1: Need Retrieval?
boolean needRetrieval = checkNeedRetrieval(query);
String context = "";
if (needRetrieval) {
List<Document> docs = retrieveWithReflection(query);
context = formatContext(docs);
}
// Generate with reflection loop
String answer;
int attempts = 0;
do {
answer = generateAnswer(query, context);
// Reflection 3: Is Supported?
if (checkIsSupported(query, answer, context)) {
return answer;
}
// If not supported, regenerate with adjusted prompt
attempts++;
} while (attempts < 3);
return answer;
}
private boolean checkNeedRetrieval(String query) {
String prompt = """
Query: %s
Do you need to retrieve external information to answer this query accurately?
Consider:
- Is this about specific documents or facts not in your training data?
- Is this about recent events or current information?
- Is this a common sense question you can answer directly?
Output: NEED_RETRIEVAL or NO_RETRIEVAL
""".formatted(query);
String response = llm.call(prompt);
return response.contains("NEED_RETRIEVAL");
}
private List<Document> retrieveWithReflection(String query) {
List<Document> candidates = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(10)
);
// Reflection 2: Is Relevant?
List<Document> relevantDocs = new ArrayList<>();
for (Document doc : candidates) {
if (checkIsRelevant(query, doc)) {
relevantDocs.add(doc);
}
}
return relevantDocs;
}
private boolean checkIsRelevant(String query, Document doc) {
String prompt = """
Query: %s
Document: %s
Is this document relevant to answering the query?
Output: RELEVANT or NOT_RELEVANT
""".formatted(query, doc.getContent());
String response = llm.call(prompt);
return response.contains("RELEVANT");
}
private boolean checkIsSupported(String query, String answer, String context) {
String prompt = """
Query: %s
Answer: %s
Context: %s
Is the answer supported by the context?
Check:
- Are all claims in the answer present in the context?
- Does the answer contradict the context?
- Does the answer rely on external knowledge not in context?
Output: SUPPORTED or NOT_SUPPORTED
""".formatted(query, answer, context);
String response = llm.call(prompt);
return response.contains("SUPPORTED");
}
private String generateAnswer(String query, String context) {
String prompt = context.isEmpty()
? "Answer: %s".formatted(query)
: """
Context: %s
Question: %s
Answer the question using only the provided context.
If the context is insufficient, state that clearly.
""".formatted(context, query);
return llm.call(prompt);
}
}
Performance: 15-20% accuracy improvement on complex QA tasks.
7.4.3 Corrective RAG (CRAG)â
Research: Corrective RAG (Yan et al., 2024)
Core Mechanism: Lightweight retrieval evaluator + fallback mechanisms.
Algorithm:
- Retrieve documents
- Evaluate retrieval quality (confidence score)
- If Poor â Trigger Web Search or fallback
- If Good â Proceed with generation
- Generate and verify
Implementation:
@Service
public class CorrectiveRAGService {
private final ChatModel llm;
private final VectorStore vectorStore;
private final WebSearchService webSearch;
public String correctiveRAG(String query) {
// Step 1: Initial retrieval
List<Document> docs = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);
// Step 2: Evaluate retrieval quality
RetrievalQuality quality = evaluateRetrieval(query, docs);
// Step 3: Route based on quality
if (quality.isPoor()) {
// Trigger web search fallback
List<WebResult> webResults = webSearch.search(query);
docs = convertWebResultsToDocs(webResults);
}
// Step 4: Generate answer
String answer = generateAnswer(query, docs);
// Step 5: Verify answer
if (verifyAnswer(answer, docs)) {
return answer;
} else {
// Fallback to web search if verification fails
List<WebResult> webResults = webSearch.search(query);
return generateAnswer(query, convertWebResultsToDocs(webResults));
}
}
private RetrievalQuality evaluateRetrieval(String query, List<Document> docs) {
String prompt = """
Query: %s
Retrieved Documents:
%s
Evaluate the quality of these documents for answering the query.
Criteria:
1. Relevance: Do the documents address the query?
2. Completeness: Is sufficient information present?
3. Accuracy: Is the information consistent and reliable?
Score: 0-100
""".formatted(query, formatDocs(docs));
String response = llm.call(prompt);
// Extract score from response
int score = extractScore(response);
return new RetrievalQuality(score);
}
private boolean verifyAnswer(String answer, List<Document> docs) {
String prompt = """
Answer: %s
Source Documents:
%s
Verify this answer:
1. Is the answer grounded in the documents?
2. Are there any hallucinations?
3. Is the answer complete?
Output: VERIFIED or NOT_VERIFIED
""".formatted(answer, formatDocs(docs));
String response = llm.call(prompt);
return response.contains("VERIFIED");
}
}
record RetrievalQuality(int score) {
public boolean isPoor() {
return score < 50;
}
public boolean isGood() {
return score >= 70;
}
}
Performance: 15-20% accuracy improvement, especially on queries with poor initial retrieval.
7.4.4 Tool Useâ
Scenario: RAG retrieves "2023 revenue data", user asks "YoY growth rate"
Problem: LLM can't calculate from raw numbers alone
Solution: Agent retrieves data â Calls Python interpreter â Returns computed result
@Service
public class ToolUseAgent {
private final ChatModel llm;
private final VectorStore vectorStore;
private final CalculatorTool calculator;
private final CodeExecutorTool codeExecutor;
public String executeWithTools(String query) {
// Step 1: Decide which tools to use
ToolPlan plan = decideTools(query);
// Step 2: Execute tools in sequence
StringBuilder context = new StringBuilder();
for (ToolAction action : plan.actions()) {
String result = executeTool(action);
context.append(result).append("\n\n");
}
// Step 3: Generate final answer with tool results
return generateAnswer(query, context.toString());
}
private ToolPlan decideTools(String query) {
String prompt = """
Query: %s
Available tools:
1. RETRIEVE - Search document database
2. CALCULATOR - Perform calculations
3. CODE_EXECUTOR - Execute Python code
Plan which tools to use and in what order.
Output format:
TOOL: tool_name
INPUT: input_for_tool
Repeat for each tool needed.
""".formatted(query);
String response = llm.call(prompt);
return parseToolPlan(response);
}
private String executeTool(ToolAction action) {
return switch (action.tool()) {
case "RETRIEVE" -> {
List<Document> docs = vectorStore.similaritySearch(
SearchRequest.query(action.input()).withTopK(5)
);
yield formatDocs(docs);
}
case "CALCULATOR" -> calculator.calculate(action.input());
case "CODE_EXECUTOR" -> codeExecutor.execute(action.input());
default -> "Unknown tool";
};
}
}
// Tool Records
record ToolPlan(List<ToolAction> actions) {}
record ToolAction(String tool, String input) {}
@Component
class CalculatorTool {
public String calculate(String expression) {
// Evaluate mathematical expression safely
ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine engine = mgr.getEngineByName("JavaScript");
try {
Object result = engine.eval(expression);
return "Calculation result: " + result;
} catch (ScriptException e) {
return "Error: " + e.getMessage();
}
}
}
@Component
class CodeExecutorTool {
public String execute(String code) {
// Execute Python code in sandboxed environment
// This is a simplified example - production requires proper sandboxing
ProcessBuilder pb = new ProcessBuilder("python3", "-c", code);
try {
Process process = pb.start();
BufferedReader reader = new BufferedReader(
new InputStreamReader(process.getInputStream())
);
StringBuilder output = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
output.append(line).append("\n");
}
return output.toString();
} catch (IOException e) {
return "Error executing code: " + e.getMessage();
}
}
}
Common Tools:
| Tool | Use Case | Example |
|---|---|---|
| RETRIEVE | Search documents | "Find revenue data for 2023" |
| CALCULATOR | Perform calculations | "Calculate YoY growth" |
| CODE_EXECUTOR | Run Python/JavaScript | "Plot this data" |
| WEB_SEARCH | Get recent information | "Latest stock price" |
| DATABASE_QUERY | Query SQL database | "Get user count" |
7.5 RAG + Fine-tuning Fusionâ
7.5.1 Complementary, Not Competingâ
Concept: RAG + Fine-tuning > Either alone
Complementarity:
| Aspect | RAG | Fine-tuning | Combined |
|---|---|---|---|
| Knowledge | External (documents) | Internal (weights) | Both |
| Updates | Real-time (add docs) | Slow (retrain) | Flexible |
| Cost | Per-query (API calls) | One-time (training) | Balanced |
| Domain Adaptation | Weak | Strong | Optimal |
| Hallucination | Possible (wrong retrieval) | Possible (wrong training) | Reduced |
7.5.2 Embedding Fine-tuning (Domain Adaptation)â
Problem: General models (OpenAI) don't understand industry jargon.
Domains: Medical (terminology), Legal (case law), Finance (regulations).
Solution: Train domain-specific embedding models.
Data Construction:
# Pseudocode: Contrastive learning for embedding fine-tuning
def construct_training_data(documents):
"""
Create positive and negative pairs for contrastive learning
Positive pairs: Similar domain-specific documents
Negative pairs: Dissimilar documents
"""
training_pairs = []
for doc in documents:
# Positive: Same topic, similar content
positive = find_similar_document(doc, documents)
# Negative: Different topic or dissimilar content
negative = find_dissimilar_document(doc, documents)
training_pairs.append({
"anchor": doc,
"positive": positive,
"negative": negative
})
return training_pairs
def contrastive_loss(anchor_emb, positive_emb, negative_emb, temperature=0.07):
"""
InfoNCE loss for contrastive learning
Pull similar items together, push dissimilar apart
"""
# Similarity scores
pos_sim = cosine_similarity(anchor_emb, positive_emb) / temperature
neg_sim = cosine_similarity(anchor_emb, negative_emb) / temperature
# Contrastive loss
loss = -log(exp(pos_sim) / (exp(pos_sim) + exp(neg_sim)))
return loss
Training Loop:
# Pseudocode: Fine-tuning loop
def fine_tune_embedding_model(model, training_data, epochs=10):
optimizer = Adam(model.parameters(), lr=1e-5)
for epoch in range(epochs):
total_loss = 0
for batch in training_data:
# Forward pass
anchor_emb = model.encode(batch["anchor"])
positive_emb = model.encode(batch["positive"])
negative_emb = model.encode(batch["negative"])
# Compute loss
loss = contrastive_loss(anchor_emb, positive_emb, negative_emb)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(training_data)
print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")
return model
Tools:
- BGE-M3: Multilingual embedding fine-tuning
- E5: English embedding fine-tuning
- Sentence Transformers: HuggingFace library
Results: 10-15% retrieval improvement in specialized domains.
7.5.3 RAFT (Retrieval Augmented Fine Tuning)â
Research: RAFT (Zhang et al., 2024)
Core Idea: Train LLM to "read RAG context correctly"
Data Format:
- Question
- Distractor Docs (noise documents to ignore)
- Relevant Docs (gold documents to use)
- Chain-of-Thought reasoning
- Answer (with citations)
Data Construction:
@Service
public class RaftDataBuilder {
private final ChatModel llm;
private final VectorStore vectorStore;
public RaftTrainingExample buildRaftExample(String question, Document goldDocument) {
// Step 1: Find relevant documents (gold)
List<Document> relevantDocs = List.of(goldDocument);
// Step 2: Find distractor documents (similar but not relevant)
List<Document> distractorDocs = findDistractors(question, goldDocument, k=4);
// Step 3: Generate chain-of-thought reasoning
String cot = generateCoT(question, relevantDocs, distractorDocs);
// Step 4: Generate answer with citations
String answer = generateAnswerWithCitations(question, relevantDocs, cot);
return new RaftTrainingExample(
question,
distractorDocs,
relevantDocs,
cot,
answer
);
}
private List<Document> findDistractors(String question, Document goldDoc, int k) {
// Retrieve similar documents
List<Document> candidates = vectorStore.similaritySearch(
SearchRequest.query(question).withTopK(20)
);
// Filter out the gold document
return candidates.stream()
.filter(doc -> !doc.getId().equals(goldDoc.getId()))
.filter(doc -> !isRelevant(doc, question)) // Must be irrelevant
.limit(k)
.toList();
}
private boolean isRelevant(Document doc, String question) {
// Use LLM to check relevance
String prompt = """
Question: %s
Document: %s
Is this document relevant to answering the question?
Output: RELEVANT or NOT_RELEVANT
""".formatted(question, doc.getContent());
String response = llm.call(prompt);
return response.contains("RELEVANT");
}
private String generateCoT(String question, List<Document> relevant, List<Document> distractors) {
String context = buildContext(relevant, distractors);
String prompt = """
Question: %s
Context:
%s
Generate step-by-step reasoning to answer the question.
Instructions:
1. Identify which documents are relevant (ignore distractors)
2. Extract key information from relevant documents
3. Reason through to the answer
4. Cite sources using [Doc N]
Format your reasoning clearly with numbered steps.
""".formatted(question, context);
return llm.call(prompt);
}
private String generateAnswerWithCitations(String question, List<Document> relevant, String cot) {
String prompt = """
Question: %s
Reasoning:
%s
Based on the reasoning above, provide a concise answer.
Include citations using [Doc N] format where:
Doc 0: %s
Doc 1: %s
...etc
""".formatted(
question,
cot,
relevant.get(0).getId(),
relevant.size() > 1 ? relevant.get(1).getId() : "N/A"
);
return llm.call(prompt);
}
private String buildContext(List<Document> relevant, List<Document> distractors) {
StringBuilder context = new StringBuilder();
context.append("=== Relevant Documents ===\n");
for (int i = 0; i < relevant.size(); i++) {
context.append(String.format("Doc %d: %s\n\n", i, relevant.get(i).getContent()));
}
context.append("\n=== Additional Documents ===\n");
for (int i = 0; i < distractors.size(); i++) {
context.append(String.format("Doc %d: %s\n\n",
i + relevant.size(),
distractors.get(i).getContent()
));
}
return context.toString();
}
}
// RAFT Training Example
record RaftTrainingExample(
String question,
List<Document> distractorDocs,
List<Document> relevantDocs,
String chainOfThought,
String answer
) {}
Training Objective:
- Learn to ignore distractor documents
- Focus on relevant documents
- Cite evidence properly
- Synthesize from multiple sources
Results: 20-30% improvement on domain QA tasks.
7.6 Performance Optimizationâ
7.6.1 Context Cachingâ
Problem: Repeated embedding computation for same prompts/docs.
Solution: Cache in KV Cache (DeepSeek, Anthropic Claude support).
Technique: Cache long system prompts, common document sets.
@Service
public class CachedRAGService {
private final ChatModel llm;
private final VectorStore vectorStore;
private final Cache<String, String> promptCache;
public String cachedQuery(String query) {
// Step 1: Check cache
String cacheKey = generateCacheKey(query);
String cachedResponse = promptCache.getIfPresent(cacheKey);
if (cachedResponse != null) {
return cachedResponse + " (cached)";
}
// Step 2: Build cache-aware prompt
String systemPrompt = loadSystemPrompt(); // Cached by LLM provider
String context = retrieveContext(query);
// Step 3: Generate with caching enabled
String response = llm.call(ChatPrompt.builder()
.system(systemPrompt)
.user(buildUserPrompt(query, context))
.build()
);
// Step 4: Cache the response
promptCache.put(cacheKey, response);
return response;
}
private String retrieveContext(String query) {
List<Document> docs = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);
return docs.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n"));
}
private String buildUserPrompt(String query, String context) {
return """
Context:
%s
Question: %s
Answer the question using the context above.
""".formatted(context, query);
}
private String generateCacheKey(String query) {
// Simple hash-based cache key
return DigestUtils.sha256Hex(query);
}
}
Caching Strategies:
| Strategy | What to Cache | Benefit | Use Case |
|---|---|---|---|
| Prompt Cache | System prompts, instructions | 90% latency reduction for cached content | Repeated prompts |
| Document Cache | Frequently retrieved docs | Skip vector search | FAQ-type queries |
| Semantic Cache | Similar queries (embeddings) | Answer similar questions | High volume |
Results: 90% latency reduction for cached content.
7.6.2 Speculative RAGâ
Problem: Large models are slow, small models are inaccurate.
Solution: Small model draft â Large model verify + retrieve.
Implementation:
@Service
public class SpeculativeRAGService {
private final ChatModel smallModel; // Fast (e.g., GPT-4o-mini)
private final ChatModel largeModel; // Accurate (e.g., GPT-4o)
private final VectorStore vectorStore;
public String speculativeRAG(String query) {
// Phase 1: Small model draft
String draftAnswer = smallModel.call(
"Answer this question: %s".formatted(query)
);
// Phase 2: Quality check
double quality = checkQuality(query, draftAnswer);
if (quality > 0.8) {
// Draft is good enough, return it
return draftAnswer;
}
// Phase 3: Large model refinement with retrieval
List<Document> docs = vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);
String refinedPrompt = """
Draft Answer: %s
Retrieved Context:
%s
Question: %s
Refine the draft answer using the retrieved context.
Improve accuracy and completeness.
""".formatted(
draftAnswer,
formatDocs(docs),
query
);
return largeModel.call(refinedPrompt);
}
private double checkQuality(String query, String answer) {
// Use LLM to score quality
String prompt = """
Question: %s
Answer: %s
Score the quality of this answer (0.0 to 1.0):
- 1.0: Accurate, complete, well-structured
- 0.5: Partially correct, missing information
- 0.0: Incorrect or irrelevant
Output only the score.
""".formatted(query, answer);
String response = smallModel.call(prompt);
try {
return Double.parseDouble(response.trim());
} catch (NumberFormatException e) {
return 0.5; // Default to medium quality
}
}
private String formatDocs(List<Document> docs) {
return docs.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n"));
}
}
Trade-off Analysis:
| Approach | Speed | Accuracy | Cost | Best For |
|---|---|---|---|---|
| Small Model Only | Fast | Lower | Low | Simple queries |
| Large Model Only | Slow | Higher | High | Complex queries |
| Speculative RAG | Medium | High | Medium | Mixed workloads |
7.6.3 Binary Quantizationâ
Problem: Float32 vectors consume huge memory.
Solution: Compress to Int1 (0/1 binary representation).
Algorithm: Scalar quantization â Binarization
@Service
public class QuantizationService {
public BinaryVector quantize(float[] vector) {
// Step 1: Normalize vector
float[] normalized = normalize(vector);
// Step 2: Binarize (0/1 based on sign)
byte[] binary = new byte[normalized.length];
for (int i = 0; i < normalized.length; i++) {
binary[i] = (byte) (normalized[i] >= 0 ? 1 : 0);
}
return new BinaryVector(binary);
}
private float[] normalize(float[] vector) {
// L2 normalization
float norm = 0;
for (float v : vector) {
norm += v * v;
}
norm = (float) Math.sqrt(norm);
float[] normalized = new float[vector.length];
for (int i = 0; i < vector.length; i++) {
normalized[i] = vector[i] / norm;
}
return normalized;
}
public int hammingDistance(BinaryVector a, BinaryVector b) {
// Fast binary distance calculation
int distance = 0;
byte[] va = a.data();
byte[] vb = b.data();
for (int i = 0; i < va.length; i++) {
if (va[i] != vb[i]) {
distance++;
}
}
return distance;
}
}
record BinaryVector(byte[] data) {}
Quantization Levels Comparison:
| Precision | Memory | Accuracy | Speed | Use Case |
|---|---|---|---|---|
| FP32 (baseline) | 100% | 100% | 1x | Benchmark |
| FP16 | 50% | 99% | 2x | Production default |
| INT8 | 25% | 97% | 4x | Cost optimization |
| INT4 | 12.5% | 95% | 8x | Edge deployment |
| INT1 (binary) | 3% | 92% | 10x | Large-scale systems |
Results:
- 30x memory reduction (FP32 â INT1)
- 10x speed improvement
- 2-3% accuracy loss (acceptable trade-off)
Use Case: Edge deployment, large-scale systems
Summaryâ
Key Takeawaysâ
Modular RAG:
- â Dynamic routing adapts to query complexity
- â Iterative retrieval handles multi-part questions
- â 30-40% reduction in unnecessary operations
- â Production frameworks: UltraRAG, CyberRAG, LangGraph
GraphRAG:
- â Solves multi-hop reasoning through relationship traversal
- â Community summaries for macro understanding
- â 77.6% MRR improvement on complex queries
- â Tools: Microsoft GraphRAG, Neo4j, NebulaGraph
Agentic RAG:
- â Self-reflection (Self-RAG) reduces hallucinations
- â Corrective feedback (CRAG) improves accuracy
- â Tool use enables computation on retrieved data
- â 15-20% accuracy improvement
RAG + Fine-tuning:
- â Embedding fine-tuning for domain adaptation
- â RAFT teaches models to use retrieved context
- â 20-30% improvement in specialized domains
- â Critical for medical, legal, financial applications
Performance Optimization:
- â Context caching: 90% latency reduction
- â Speculative RAG: Balanced speed and accuracy
- â Binary quantization: 30x memory savings
- â Essential for production systems
Production Decision Guideâ
| Scenario | Recommended Technique | Priority | Expected Improvement |
|---|---|---|---|
| High traffic, simple queries | Optimization + Modular RAG | High | 30-40% cost reduction |
| Complex relationships | GraphRAG | Medium | 2-3x MRR improvement |
| Accuracy-critical (medical/legal) | Agentic RAG + Fine-tuning | High | 15-30% accuracy gain |
| Specialized domain | RAG + Embedding fine-tuning | Medium | 10-15% retrieval gain |
| Budget-constrained | Quantization + Caching | High | 90% latency reduction |
| Enterprise-grade | Modular + Graph + Agentic | Medium | Combined benefits |
Further Readingâ
Papers:
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection (Asai et al., 2024)
- From Local to Global: A Graph RAG Approach to Query-Focused Summarization (Edge et al., 2024)
- RAFT: Retrieval Augmented Fine Tuning (Zhang et al., 2024)
- Adaptive-RAG: Learning When to Retrieve and Generate (Jeong et al., 2024)
- Corrective Retrieval Augmented Generation (Yan et al., 2024)
Tools:
- LangGraph - Agentic orchestration
- Microsoft GraphRAG - Knowledge graph construction
- Neo4j - Graph database
- vLLM - Optimization
- UltraRAG, CyberRAG - Modular frameworks
Next Steps:
- đ Review RAG Fundamentals for system architecture
- đ Study Evaluation Strategies to measure improvements
- đģ Implement Modular RAG with dynamic routing for your workload
- đ§ Add GraphRAG for multi-hop reasoning queries
- đ Set up performance monitoring with caching and quantization