System Design

System design is not a collection of buzzwords. It is the practice of making explicit trade-offs under constraints: latency, throughput, availability, consistency, cost, security, and organizational complexity.

This knowledge base is organized as a layered mental model. Layers are not “mandatory components”; they are a way to reason about where responsibilities live and how failure modes propagate.

Layer Map

Layer	Primary Goal	Typical Responsibilities	Core Trade-off
1. Entry Layer	Safe, governed ingress	Load balancing, gateway policies, auth, rate limiting, edge routing	Governance vs latency and blast radius
2. Service Layer	Business capability delivery	Service boundaries, communication patterns, resilience behaviors	Team autonomy vs distributed complexity
3. Storage Layer	Correctness and scale ceiling	Data modeling, replication, sharding, consistency models	Strong guarantees vs availability/throughput
4. Caching Layer	Low latency and origin protection	Cache patterns, invalidation strategy, distributed caches	Speed vs staleness and operational risk
5. Messaging & Analytics	Async collaboration and insight	Queues/logs, events, indexing/search, analytics	Decoupling vs observability and semantics
6. BOE Estimation	Quick feasibility assessment	Capacity planning, bottleneck identification, cost estimation	Precision vs speed of decision-making

How To Use This Section

Read in this order for most systems:

Entry and Service: define request paths and responsibility boundaries.
Storage and Caching: define correctness, consistency, and latency strategy.
Messaging and Analytics: define async semantics, workflows, and query/insight needs.
BOE Estimation: use back-of-the-envelope calculations throughout design process.

A Practical Trade-off Framework

When deciding between designs, answer these questions explicitly:

Which metrics are protected first (p99 latency, error rate, availability, cost, security)?
Under peak load or partial failure, what breaks first and how does it degrade?
Where does complexity move (product teams, platform team, ops/SRE, data team)?
What is the migration cost in three months when requirements change?
What is the “escape hatch” when your current assumptions stop holding?

Common Anti-Patterns

Designing the system before agreeing on SLOs, scale, and constraints.
Over-optimizing early (sharding on day one, microservices everywhere, “event-driven” without clear semantics).
Treating “eventual consistency” as a free performance win without defining user-visible behavior.
Adding layers without owning their operational costs (on-call, monitoring, incident response).
Relying on retries everywhere without budgets (retry storms create outages).

Intelligence Layer (AI/ML Serving)

Modern systems increasingly include an AI/ML serving layer:

┌─────────────────────────────────────────┐
│         Client Layer (Web/Mobile)        │
├─────────────────────────────────────────┤
│         Entry Layer (LB/API Gateway)     │
├─────────────────────────────────────────┤
│         Service Layer (Business Logic)   │
├─────────────────────────────────────────┤
│      Intelligence Layer (AI/ML)    ← NEW │
│  ┌─────────┐ ┌────────┐ ┌───────────┐  │
│  │ LLM API │ │ RAG    │ │ Feature   │  │
│  │ Gateway │ │ Engine │ │ Store     │  │
│  └─────────┘ └────────┘ └───────────┘  │
├─────────────────────────────────────────┤
│         Data Layer (DB/Cache/MQ)         │
└─────────────────────────────────────────┘

Key Components:

Component	Purpose	Examples
Model Gateway	Route to models, rate limit, cache	LiteLLM, OpenRouter, vLLM
RAG Engine	Retrieval + generation pipeline	LlamaIndex, LangChain
Feature Store	ML feature management	Feast, Tecton
Vector DB	Semantic search, embeddings	Milvus, Pinecone, pgvector
Model Registry	Version and deploy models	MLflow, W&B

Design Considerations:

Latency: LLM inference adds 100ms-10s; use streaming for perceived responsiveness
Cost: Token-based pricing requires careful caching and routing strategies
Observability: Trace full request path including model calls (OpenTelemetry + LangSmith)
Fallback: Always have a rule-based fallback when AI services are unavailable

Observability Layer

Cross-cutting concern spanning all layers:

Pillar	Purpose	Tools
Logging	Event records	ELK Stack, Loki, Fluentd
Metrics	Numeric time-series	Prometheus, Grafana, Datadog
Tracing	Request flow across services	Jaeger, Zipkin, OpenTelemetry

OpenTelemetry has become the standard for unified observability:

Single API for traces, metrics, and logs
Vendor-neutral instrumentation
Auto-instrumentation for Java, Python, Node.js, Go
Integrates with all major backends

System Design

Layer Map

How To Use This Section

A Practical Trade-off Framework

Common Anti-Patterns

Intelligence Layer (AI/ML Serving)

Observability Layer

Navigation

Further Reading (Selected)

Layer Map​

How To Use This Section​

A Practical Trade-off Framework​

Common Anti-Patterns​

Intelligence Layer (AI/ML Serving)​

Observability Layer​

Navigation​

Further Reading (Selected)​

Layer Map

How To Use This Section

A Practical Trade-off Framework

Common Anti-Patterns

Intelligence Layer (AI/ML Serving)

Observability Layer

Navigation

Further Reading (Selected)