System Design
System design is not a collection of buzzwords. It is the practice of making explicit trade-offs under constraints: latency, throughput, availability, consistency, cost, security, and organizational complexity.
This knowledge base is organized as a layered mental model. Layers are not “mandatory components”; they are a way to reason about where responsibilities live and how failure modes propagate.
Layer Map
| Layer | Primary Goal | Typical Responsibilities | Core Trade-off |
|---|---|---|---|
| 1. Entry Layer | Safe, governed ingress | Load balancing, gateway policies, auth, rate limiting, edge routing | Governance vs latency and blast radius |
| 2. Service Layer | Business capability delivery | Service boundaries, communication patterns, resilience behaviors | Team autonomy vs distributed complexity |
| 3. Storage Layer | Correctness and scale ceiling | Data modeling, replication, sharding, consistency models | Strong guarantees vs availability/throughput |
| 4. Caching Layer | Low latency and origin protection | Cache patterns, invalidation strategy, distributed caches | Speed vs staleness and operational risk |
| 5. Messaging & Analytics | Async collaboration and insight | Queues/logs, events, indexing/search, analytics | Decoupling vs observability and semantics |
| 6. BOE Estimation | Quick feasibility assessment | Capacity planning, bottleneck identification, cost estimation | Precision vs speed of decision-making |
How To Use This Section
Read in this order for most systems:
- Entry and Service: define request paths and responsibility boundaries.
- Storage and Caching: define correctness, consistency, and latency strategy.
- Messaging and Analytics: define async semantics, workflows, and query/insight needs.
- BOE Estimation: use back-of-the-envelope calculations throughout design process.
A Practical Trade-off Framework
When deciding between designs, answer these questions explicitly:
- Which metrics are protected first (p99 latency, error rate, availability, cost, security)?
- Under peak load or partial failure, what breaks first and how does it degrade?
- Where does complexity move (product teams, platform team, ops/SRE, data team)?
- What is the migration cost in three months when requirements change?
- What is the “escape hatch” when your current assumptions stop holding?
Common Anti-Patterns
- Designing the system before agreeing on SLOs, scale, and constraints.
- Over-optimizing early (sharding on day one, microservices everywhere, “event-driven” without clear semantics).
- Treating “eventual consistency” as a free performance win without defining user-visible behavior.
- Adding layers without owning their operational costs (on-call, monitoring, incident response).
- Relying on retries everywhere without budgets (retry storms create outages).
Intelligence Layer (AI/ML Serving)
Modern systems increasingly include an AI/ML serving layer:
┌─────────────────────────────────────────┐
│ Client Layer (Web/Mobile) │
├─────────────────────────────────────────┤
│ Entry Layer (LB/API Gateway) │
├─────────────────────────────────────────┤
│ Service Layer (Business Logic) │
├─────────────────────────────────────────┤
│ Intelligence Layer (AI/ML) ← NEW │
│ ┌─────────┐ ┌────────┐ ┌───────────┐ │
│ │ LLM API │ │ RAG │ │ Feature │ │
│ │ Gateway │ │ Engine │ │ Store │ │
│ └─────────┘ └────────┘ └───────────┘ │
├─────────────────────────────────────────┤
│ Data Layer (DB/Cache/MQ) │
└─────────────────────────────────────────┘
Key Components:
| Component | Purpose | Examples |
|---|---|---|
| Model Gateway | Route to models, rate limit, cache | LiteLLM, OpenRouter, vLLM |
| RAG Engine | Retrieval + generation pipeline | LlamaIndex, LangChain |
| Feature Store | ML feature management | Feast, Tecton |
| Vector DB | Semantic search, embeddings | Milvus, Pinecone, pgvector |
| Model Registry | Version and deploy models | MLflow, W&B |
Design Considerations:
- Latency: LLM inference adds 100ms-10s; use streaming for perceived responsiveness
- Cost: Token-based pricing requires careful caching and routing strategies
- Observability: Trace full request path including model calls (OpenTelemetry + LangSmith)
- Fallback: Always have a rule-based fallback when AI services are unavailable
Observability Layer
Cross-cutting concern spanning all layers:
| Pillar | Purpose | Tools |
|---|---|---|
| Logging | Event records | ELK Stack, Loki, Fluentd |
| Metrics | Numeric time-series | Prometheus, Grafana, Datadog |
| Tracing | Request flow across services | Jaeger, Zipkin, OpenTelemetry |
OpenTelemetry has become the standard for unified observability:
- Single API for traces, metrics, and logs
- Vendor-neutral instrumentation
- Auto-instrumentation for Java, Python, Node.js, Go
- Integrates with all major backends
Navigation
- 1. Entry Layer
- 2. Service Layer
- 3. Storage Layer
- 4. Caching Layer
- 5. Messaging & Analytics Layer
- 6. Back-of-the-Envelope Estimation
Further Reading (Selected)
- Martin Fowler: Microservices and evolutionary architecture essays.
- Martin Kleppmann: Designing Data-Intensive Applications (replication, consistency, streams).
- AWS Builders Library: reliability, timeouts/retries, and operational excellence patterns.
- Elastic documentation: shard sizing, replicas, and lifecycle management concepts.