Skip to main content

System Design

System design is not a collection of buzzwords. It is the practice of making explicit trade-offs under constraints: latency, throughput, availability, consistency, cost, security, and organizational complexity.

This knowledge base is organized as a layered mental model. Layers are not “mandatory components”; they are a way to reason about where responsibilities live and how failure modes propagate.

Layer Map

LayerPrimary GoalTypical ResponsibilitiesCore Trade-off
1. Entry LayerSafe, governed ingressLoad balancing, gateway policies, auth, rate limiting, edge routingGovernance vs latency and blast radius
2. Service LayerBusiness capability deliveryService boundaries, communication patterns, resilience behaviorsTeam autonomy vs distributed complexity
3. Storage LayerCorrectness and scale ceilingData modeling, replication, sharding, consistency modelsStrong guarantees vs availability/throughput
4. Caching LayerLow latency and origin protectionCache patterns, invalidation strategy, distributed cachesSpeed vs staleness and operational risk
5. Messaging & AnalyticsAsync collaboration and insightQueues/logs, events, indexing/search, analyticsDecoupling vs observability and semantics
6. BOE EstimationQuick feasibility assessmentCapacity planning, bottleneck identification, cost estimationPrecision vs speed of decision-making

How To Use This Section

Read in this order for most systems:

  • Entry and Service: define request paths and responsibility boundaries.
  • Storage and Caching: define correctness, consistency, and latency strategy.
  • Messaging and Analytics: define async semantics, workflows, and query/insight needs.
  • BOE Estimation: use back-of-the-envelope calculations throughout design process.

A Practical Trade-off Framework

When deciding between designs, answer these questions explicitly:

  1. Which metrics are protected first (p99 latency, error rate, availability, cost, security)?
  2. Under peak load or partial failure, what breaks first and how does it degrade?
  3. Where does complexity move (product teams, platform team, ops/SRE, data team)?
  4. What is the migration cost in three months when requirements change?
  5. What is the “escape hatch” when your current assumptions stop holding?

Common Anti-Patterns

  • Designing the system before agreeing on SLOs, scale, and constraints.
  • Over-optimizing early (sharding on day one, microservices everywhere, “event-driven” without clear semantics).
  • Treating “eventual consistency” as a free performance win without defining user-visible behavior.
  • Adding layers without owning their operational costs (on-call, monitoring, incident response).
  • Relying on retries everywhere without budgets (retry storms create outages).

Further Reading (Selected)

  • Martin Fowler: Microservices and evolutionary architecture essays.
  • Martin Kleppmann: Designing Data-Intensive Applications (replication, consistency, streams).
  • AWS Builders Library: reliability, timeouts/retries, and operational excellence patterns.
  • Elastic documentation: shard sizing, replicas, and lifecycle management concepts.