Skip to main content

Context Engineering

Context engineering focuses on the critical challenge of managing limited context windows in Large Language Models (LLMs). It encompasses strategies for selecting, organizing, and optimizing the information that models can access at inference time.

The Context Window Challenge

What is a Context Window?

The context window is the maximum amount of text (measured in tokens) that an LLM can process in a single request. This includes:

  • System prompts
  • Conversation history
  • Retrieved documents
  • User input
  • Expected output

Context Window Sizes (2025)

ModelContext WindowNotes
GPT-4 Turbo128K tokensProduction-proven
Claude 3.5 Sonnet200K tokensExcellent for code
Gemini Pro1M tokensLargest available
Claude 3 Opus200K tokensHigh quality

Key Insight: Token count ≠ Word count. Roughly 1K tokens ≈ 750 words, but code and special characters use more tokens.

The Core Problems

1. Limited Capacity

Problem: Important information gets cut off when it exceeds the context window.

Example: A 500-page technical manual cannot fit in a single request.
Impact: The model cannot see all relevant information and may miss critical details.

2. Information Density

Problem: Not all information is equally valuable.

Example: A 100-page document may contain only 5 pages of relevant information.
Impact: Filling context with low-value information wastes limited capacity.

3. Retrieval Precision

Problem: Finding the RIGHT information is harder than finding SOME information.

Example: Searching for "authentication" may return irrelevant mentions.
Impact: Poor retrieval leads to poor responses regardless of model quality.

Context Engineering Strategies

Strategy 1: Retrieval-Augmented Generation (RAG)

Concept: Retrieve only the most relevant documents and inject them into the context.

# Basic RAG flow
user_query = "How do I implement OAuth 2.0 with Spring Security?"

# 1. Search for relevant documents
relevant_docs = vector_store.search(query, top_k=5)

# 2. Build context
context = "\n\n".join([doc.content for doc in relevant_docs])

# 3. Inject into prompt
prompt = f"""
Context:
{context}

Question: {user_query}

Answer based on the context above.
"""

Best Practices:

  • Use semantic search (embeddings) not keyword search
  • Implement relevance scoring with thresholds
  • Include document metadata (source, date, author)

Strategy 2: Hierarchical Summarization

Concept: Create multi-level summaries to provide both overview and details.

# Three-tier summary structure
class DocumentSummary:
executive_summary: str # 100 words, high-level concepts
section_summaries: List[str] # 300 words each, key points
detailed_excerpts: List[str] # Full text for critical sections

# Use based on query complexity
if query.is_high_level():
context = doc.executive_summary
elif query.requires_detail():
context = doc.section_summaries
else:
context = doc.detailed_excerpts

Strategy 3: Query-Based Routing

Concept: Route queries to specialized contexts based on intent.

# Intent-based context selection
query_intents = analyze_intents(user_query)

if "code" in query_intents:
context = code_repository_context
elif "documentation" in query_intents:
context = documentation_context
elif "architecture" in query_intents:
context = architecture_diagrams_context
else:
context = general_knowledge_base

Strategy 4: Dynamic Context Pruning

Concept: Continuously remove less relevant information as context fills up.

# Priority-based context management
context_items = [
{"content": item, "priority": score, "timestamp": now}
for item, score in retrieved_items
]

# Sort by priority and keep top-N
context_items.sort(key=lambda x: x["priority"], reverse=True)
active_context = context_items[:max_items]

# Remove oldest low-priority items when at capacity
def should_add(new_item, current_context):
if len(current_context) < max_items:
return True
lowest_priority = min(current_context, key=lambda x: x["priority"])
return new_item["priority"] > lowest_priority["priority"]

Optimization Techniques

1. Token Optimization

Compress prompts without losing meaning:

# Verbose: 150 tokens
"""
You are a helpful assistant with expertise in Java programming,
specifically the Spring Boot framework. Please help the user by
answering their questions about building web applications.
"""

# Concise: 35 tokens (same effect)
"Expert Spring Boot developer. Answer questions concisely with code examples."

2. Reusable Context Patterns

Define context templates:

# Define once, reuse everywhere
SYSTEM_PROMPTS = {
"code_review": """Senior code reviewer. Focus on: security, performance,
maintainability. Provide specific line references.""",

"architecture": """Solutions architect. Consider: scalability, reliability,
cost-efficiency. Compare trade-offs explicitly.""",

"debugging": """Senior engineer. Debug systematically: identify symptoms,
analyze causes, propose solutions with verification steps."""
}

3. Context Caching

Cache expensive context operations:

# Cache embeddings and search results
@cache.memoize(timeout=3600)
def get_context_for_query(query: str) -> List[Document]:
# Expensive: embedding + search
return vector_store.search(query, top_k=10)

# Only retrieve new information each request
cached_context = get_context_for_query(query)
new_information = filter_new(cached_context, recent_docs)
final_context = cached_context + new_information

4. Multi-Turn Context Management

Manage conversation history efficiently:

class ConversationManager:
def summarize_history(self, messages: List[Message]) -> str:
"""Compress old messages into summary"""
recent = messages[-5:] # Keep last 5 messages verbatim
old = messages[:-5] # Summarize older messages

summary = llm.complete(f"""
Summarize this conversation concisely:
{format_messages(old)}

Include: topics discussed, decisions made, key information.
""")

return f"Previous conversation summary:\n{summary}\n\nRecent messages:\n{format_messages(recent)}"

Evaluation Metrics

Context Quality Metrics

MetricDescriptionTarget
Retrieval Precision% of retrieved docs that are relevant> 80%
Retrieval Recall% of relevant docs that are retrieved> 70%
Context Utilization% of context window that's actually used> 60%
Answer Accuracy% of answers that use retrieved info correctly> 85%

Monitoring

# Track context performance
context_metrics = {
"retrieval_time": time_taken,
"tokens_used": input_tokens + output_tokens,
"retrieved_docs": len(retrieved),
"context_precision": calculate_precision(retrieved, relevant),
"answer_relevance": score_relevance(answer, query),
}

# Log for analysis
context_logger.log(context_metrics)

Tools and Frameworks

  • Pinecone: Managed vector database with excellent performance
  • Weaviate: Open-source, supports hybrid search
  • Qdrant: High-performance, easy to self-host
  • pgvector: PostgreSQL extension for vector search

Context Management Libraries

  • LangChain: Context managers, retrievers, and document loaders
  • LlamaIndex: Advanced indexing and retrieval strategies
  • Haystack: Deep learning for context retrieval
  • Chroma: Lightweight vector database for development

Advanced Patterns

The Re-Ranking Pattern

1. Initial Retrieval: Get 50-100 documents (fast, approximate)
2. Re-Ranking: Use a more sophisticated model to rank top 10
3. Context Injection: Use only the top 5 for actual generation

The Knowledge Graph Pattern

1. Extract entities and relationships from documents
2. Build a graph of connected information
3. Traverse graph to find related context
4. Provide both documents AND relationships

The Mixture of Experts Pattern

1. Classify query type (code, architecture, debugging)
2. Route to specialized retrieval system for that type
3. Use domain-specific context templates
4. Merge results for comprehensive answer

Common Pitfalls

Pitfall 1: Over-Retrieval

Problem: Retrieving too many documents drowns out relevant information.

Solution: Focus on precision over recall. Better to miss a document than to have 50 irrelevant ones.

Pitfall 2: Ignoring Metadata

Problem: Not filtering by date, version, or relevance leads to stale information.

Solution: Always include metadata in retrieval and display it to users.

Pitfall 3: Static Context

Problem: Using the same context for all queries regardless of intent.

Solution: Implement query analysis and dynamic context selection.

Best Practices Summary

  1. Measure everything: Track retrieval quality and context utilization
  2. Iterate constantly: Context engineering requires continuous refinement
  3. Balance breadth and depth: Don't sacrifice breadth for relevance or vice versa
  4. Involve users: Let them provide feedback on context quality
  5. Plan for scale: Design context systems that handle growth

Further Reading