6. Back-of-the-Envelope Estimation

Back-of-the-envelope (BOE) estimation is a critical skill for system design interviews and real-world architecture decisions. It allows you to quickly evaluate whether a proposed design is feasible, identify bottlenecks, and make informed trade-offs without detailed calculations.

What is Back-of-the-Envelope Estimation?

BOE estimation is the practice of making quick, rough calculations to:

Validate feasibility: Can this design handle the required scale?
Identify bottlenecks: Which component will fail first?
Guide architecture decisions: What trade-offs make sense?
Estimate costs: Rough order-of-magnitude cost projections
Enable communication: Explain your reasoning to stakeholders

Key Principle: BOE is about being right within an order of magnitude (2x-10x), not about being precise. In system design, precise numbers are often wrong anyway because assumptions change. Good approximate reasoning is more valuable than false precision.

Why It Matters

1. Quick Feasibility Assessment

In interviews and design meetings, you need to quickly determine if a design is in the right ballpark. BOE lets you reject obviously infeasible approaches early.

2. Bottleneck Identification

By estimating capacity of each component, you find the weakest link. This guides where to focus optimization effort.

3. Cost-Benefit Analysis

Quick estimates let you compare approaches. Is this expensive solution worth it for the expected scale?

4. Communication

Stating "this design can handle ~1M QPS with these 4 servers" is more convincing than "this should work."

Common Estimation Scenarios

Systematic Estimation Framework

Key Principles:

Start with constraints: What are you estimating? (traffic, storage, performance)
Use reasonable assumptions: Document assumptions explicitly
Calculate step by step: Show your work, don't skip steps
Identify bottlenecks: Which component fails first?
Add safety margin: Systems rarely perform as expected in production
Present with confidence: "We can handle ~1M QPS with these 4 servers" (not "exactly 987,654 QPS")

Request Rate Estimation

Starting Point: Number of users and their behavior

Helpful Numbers to Remember:

Active users: DAU (Daily Active Users) or MAU (Monthly Active Users)
Request patterns: Average requests per user per session/day
Peak vs Average: Peak is typically 2-5x average (plan for 3x as rough heuristic)

Estimation Steps:

Estimate user base:
- New product: Make reasonable assumptions (e.g., 10M users over 3 years)
- Existing product: Use current numbers
Estimate active users:
- Consumer apps: 10-30% of registered users are DAU
- Enterprise tools: 50-80% of registered users are DAU
- Social media: 20-50% DAU/MAU ratio
Estimate requests per user:
- Read-heavy: 5-20 requests/day
- Interactive: 20-100 requests/day
- Write-heavy: 1-5 requests/day

Calculate average QPS:

DAU × requests_per_user_per_day = total_requests_per_day
total_requests_per_day / 86400 = average_QPS

Apply peak multiplier:
```
peak_QPS = average_QPS × 3 (heuristic)
```

Example: Twitter-like Service

100M registered users
20% DAU (20M daily active)
10 requests per user per day
Peak multiplier: 3x

Calculation:

Total requests/day = 20M users × 10 requests = 200M requests/day
Average QPS = 200,000,000 / 86,400 ≈ 2,300 QPS
Peak QPS = 2,300 × 3 ≈ 7,000 QPS

System Design Implications:

API Servers: 7K QPS is manageable with 4-8 servers (1K-2K QPS per server)
Database: Primary handles writes (~700 QPS), read replicas handle reads (~6.3K QPS)
Cache: Critical for read-heavy workload (80% cache hit = 5K QPS from cache, 1.3K from DB)
Bandwidth: 120 Mbps is modest, well within single 1 Gbps network capacity

Storage Estimation

Starting Point: Data model and retention requirements

Key Components:

Object size (average size per item)
Number of objects
Retention period
Replication factor

Estimation Steps:

Estimate object size:
- User profile: ~1-5 KB
- Tweet/post: ~200 bytes - 1 KB
- Photo: ~100 KB - 5 MB
- Video: Minutes × bitrate
Estimate object count:
- Users × objects_per_user
- Growth rate: new objects per day

Calculate storage:

total_storage = object_count × object_size × replication_factor

Account for growth:
- Plan for 1-3 years out
- Consider expected growth rate

Example: Photo Storage Service

10M users
100 photos per user on average
Average photo size: 500 KB
2 replicas (for availability)
50% growth per year

Calculation:

Current storage = 10M × 100 × 500 KB × 2 = 1 TB
1 year growth = 1 TB × 1.5 = 1.5 TB
2 years = 1 TB × (1 + 1.5 + 1.5²) ≈ 5.5 TB
3 years = 1 TB × (1 + 1.5 + 1.5² + 1.5³) ≈ 12 TB

Plan for: 12-15 TB storage

Bandwidth Estimation

Starting Point: Request/response size and request rate

Key Components:

Request size (outbound)
Response size (inbound)
Request rate
Read/Write ratio (typical 10:1 read-to-write)

Estimation Steps:

Calculate average request size:
- API request: ~100 bytes - 1 KB
- Photo upload: Up to 5 MB
- Video: Highly variable
Calculate average response size:
- Read response: ~1-10 KB (depends on use case)
- Write confirmation: ~100 bytes

Calculate bandwidth:

bandwidth_per_second = QPS × (request_size + response_size)

Example: API Service

Peak QPS: 10,000
Request size: 200 bytes
Response size: 2 KB
Read/write ratio: 9:1 (9 reads, 1 write)

Calculation:

Read bandwidth = 10,000 QPS × 0.9 × 2 KB = 18 MB/s
Write bandwidth = 10,000 QPS × 0.1 × 0.2 KB = 0.2 MB/s
Total bandwidth = 18.2 MB/s ≈ 146 Mbps

Add overhead: HTTP headers (20-30%), encryption (5-10%)
Final estimate: 146 Mbps × 1.3 ≈ 190 Mbps

Network Cost Considerations:

Data transfer egress charges (AWS, GCP, Azure)
1 TB/month is common free tier
Calculate monthly cost: bandwidth × seconds_per_month

Database Capacity Estimation

Starting Point: Data model and access patterns

Single Machine Limits:

Modern database server: ~10K-50K QPS (depends on query complexity)
IOPS limit: ~5K-10K random IOPS per SSD
Network: ~1 Gbps common

Estimation Steps:

Estimate read vs write ratio:
- Social media: 100:1 read-heavy
- E-commerce: 10:1 to 50:1
- Messaging: 1:1 balanced

Calculate read and write QPS separately:

read_QPS = total_QPS × (read_ratio / (read_ratio + write_ratio))
write_QPS = total_QPS - read_QPS

Estimate machines needed:

machines_for_reads = read_QPS / machine_read_capacity
machines_for_writes = write_QPS / machine_write_capacity
total_machines = machines_for_reads + machines_for_writes

Add replicas for availability:
- Typical: 2-3 replicas
- Factor into machine count

Example: Social Media Post

Peak QPS: 7,000
Read/write ratio: 100:1
Single machine capacity: 10K QPS (reads), 5K QPS (writes)
3 replicas for availability

Calculation:

Read QPS = 7,000 × (100 / 101) ≈ 6,930 QPS
Write QPS = 7,000 × (1 / 101) ≈ 70 QPS

Read machines = 6,930 / 10,000 ≈ 0.7 → Use 1 (with headroom)
Write machines = 70 / 5,000 ≈ 0.01 → Use 1 (with headroom)

Primary machines: 2 (1 for reads + 1 for writes, with headroom)
Replica machines: 4 (for HA, 2 replicas)
Total machines: 6

Cache Capacity Estimation

Starting Point: Cache hit ratio and working set size

Key Concepts:

Cache hit ratio: Percentage of requests served from cache
Working set: Frequently accessed data
Cache size: Should accommodate working set with room for growth

Estimation Steps:

Estimate cacheable data:
- What % of requests can be cached?
- Hot data vs long-tail data
Estimate hit ratio:
- Well-designed cache: 70-90% hit ratio
- Poor cache: 20-40% hit ratio
Calculate cache size:
- Estimate hot data set
- Multiply by object size
- Add 2-3x headroom

Example: Product Catalog Cache

Total products: 10M
Hot products (80% of traffic): 100K
Product data size: 5 KB per product
90% hit ratio target

Calculation:

Hot data size = 100,000 × 5 KB = 500 MB
Add headroom: 500 MB × 2 = 1 GB
Add metadata/overhead: 1 GB × 1.2 = 1.2 GB

Plan per cache node: 1.5-2 GB

Memory Estimation

Starting Point: Application memory requirements

Key Components:

Base memory (framework, runtime overhead)
Per-connection memory
Cached data
Worker/Thread memory

Estimation Steps:

Base memory per instance:
- Simple API: 100-500 MB
- Complex application: 1-4 GB
- Java applications: 512 MB - 2 GB (JVM overhead)
Per-connection memory:
- HTTP connection: ~10-50 KB
- Database connection: ~1-5 MB

Calculate total memory:

total_memory = base_memory + (max_connections × per_connection_memory)

Add headroom (2-4x):
- Prevents OOM
- Allows for GC overhead (managed languages)
- Peak traffic headroom

Example: API Server

Base memory: 500 MB
Max concurrent connections: 10,000
Per-connection memory: 20 KB
Headroom multiplier: 2

Calculation:

Connection memory = 10,000 × 20 KB = 200 MB
Total working memory = 500 MB + 200 MB = 700 MB
With headroom: 700 MB × 2 = 1.4 GB

Plan per instance: 2 GB memory

Common Mistakes in BOE Estimation

1. Being Too Precise

Mistake: Calculating to many significant figures ("we need exactly 1,247 machines")

Why it's wrong: Assumptions are rough, so precise calculations give false impression of accuracy

Better approach: "We need roughly 1,200-1,300 machines" or "~1.3K machines"

2. Forgetting Peak vs Average

Mistake: Planning for average load only

Why it's wrong: Peak is when failures happen. Systems sized for average fail during peak

Better approach: Size for peak (typically 2-5x average, use 3x as heuristic)

3. Ignoring Replication and Overhead

Mistake: Calculating raw storage only

Why it's wrong: Need replication for availability, overhead for indexing/metadata

Better approach: Multiply by replication factor (2-3x) and overhead (1.2-1.5x)

4. Not Considering Growth

Mistake: Calculating for current load only

Why it's wrong: System will be outdated quickly if you don't plan for growth

Better approach: Plan 1-3 years out with growth rate (50-100% per year common)

5. Single-Component Focus

Mistake: Optimizing one component's estimate while ignoring others

Why it's wrong: Bottleneck moves to next component

Better approach: Estimate all components in path to find true bottleneck

6. Using Wrong Units

Mistake: Mixing bits/bytes, MB/miB, or confusing time units

Why it's wrong: Order-of-magnitude errors (8x difference between bits and bytes)

Better approach: Be explicit about units, convert carefully

Reference Numbers to Remember

User Behavior:

Active user ratio: 10-30% (consumer), 50-80% (enterprise)
Requests per user per day: 5-20 (read-heavy), 20-100 (interactive)
Session duration: 10-30 minutes (consumer apps), 2-8 hours (enterprise tools)

System Capacity (per modern machine):

API server: 1K-10K QPS (depends on complexity)
Database (reads): 10K-50K QPS (simple queries)
Database (writes): 5K-20K QPS (depends on transaction complexity)
Cache (Redis): 10K-100K QPS
Load balancer: 10K-100K QPS (L4), 1K-10K QPS (L7)

Storage Costs (rough AWS equivalents):

S3: ~$0.02-0.03/GB/month
EBS SSD: ~$0.08-0.15/GB/month
S3 Standard vs IA: 3-5x cost difference

Bandwidth Costs:

Outbound: ~$0.05-0.15/GB (varies by region)
Inbound: Often free

Data Sizes:

Short text (tweet): 200 bytes - 1 KB
User profile: 1-5 KB
Photo (compressed): 100 KB - 500 KB
Photo (original): 1-5 MB
Full HD movie: ~4-8 GB
Full HD TV show episode: ~1-3 GB

Practical Estimation Framework

Step 1: Clarify Requirements

Users: How many total? How many active?
Growth: What's expected growth rate?
Timeline: How many years should this design last?

Step 2: Identify Key Constraints

Which matters most for this design?

Read-heavy: Focus on read capacity, caching
Write-heavy: Focus on write capacity, database sharding
Storage-heavy: Focus on storage cost, retention
Latency-sensitive: Focus on geographic distribution, caching

Step 3: Estimate Component by Component

Start from user request, trace through each component:

Entry layer: Load balancer capacity
API servers: Request processing capacity
Cache: Hit ratio, hot data size
Database: Read/write split, per-machine limits
Storage: Total storage, growth rate

Step 4: Find Bottleneck

Compare component capacities. The component with lowest capacity (relative to requirements) is the bottleneck.

Step 5: Sanity Check

Are numbers reasonable? (Not "we need 1 billion machines")
Did I account for replication and overhead?
Is there headroom for growth and failures?

BOE in System Design Interviews

Interview Format:

Interviewer: "Design a URL shortener"
You: Ask clarifying questions (scale, functional requirements)
Interviewer: "100M URLs shortened per day, 100M daily reads"
You: Do BOE estimation to guide design

Example Interview Estimation (URL Shortener):

Requirements:

100M writes/day
100M reads/day
10:1 read:write ratio
Store for 5 years

Calculations:

Write QPS = 100,000,000 / 86,400 ≈ 1,160 QPS
Peak writes = 1,160 × 3 ≈ 3,500 QPS

Read QPS = 100,000,000 / 86,400 ≈ 1,160 QPS
Peak reads = 1,160 × 3 ≈ 3,500 QPS

URL size: Short URL (7 chars) + Long URL (100 avg) = ~107 bytes
5 years storage = 100M × 365 days × 5 years × 107 bytes ≈ 20 TB
With replication (3x): 60 TB total

Memory cache (hot URLs, 80% of traffic):
20 TB × 0.2 = 4 TB working set
Cache per node: 6-8 GB (with 2-3x headroom)

Machine requirements (DB writes limited):
Single DB: 5K write QPS
DB machines needed = 3,500 / 5,000 ≈ 1
+ replicas for HA: 1 primary + 2 replicas = 3 machines
API servers (10K QPS each):
4,000 QPS / 10,000 = 1 API server (too tight)
Use 2-3 API servers for headroom

Key Insights from BOE:

Write-heavy (DB bottleneck), need caching strategy
Storage manageable (60 TB over 5 years)
Read capacity needs optimization (aggressive caching)
Not many machines needed (can start small and scale)

When BOE is Insufficient

Need detailed calculation when:

Making purchasing decisions (exact cost matters)
Performance-critical systems (milliseconds matter)
Expensive infrastructure (cloud costs significant)
Compliance requirements (exact numbers needed)

Use BOE for:

Feasibility assessment (yes/no decision)
Architecture comparison (within 2-10x)
Interview design (order-of-magnitude sufficient)
Initial planning (before detailed design phase)

Common Estimation Shortcuts

Rough Conversions:

1 K = 1,000
1 M = 1,000 K = 1,000,000
1 B = 8 bits
1 GB = 1,000 MB
1 TB = 1,000 GB
8 hours = ~30,000 seconds
1 day = ~86,000 seconds

Useful Multiples:

2×: Double, significant change
10×: Order of magnitude
100×: Two orders of magnitude
1000×: Three orders of magnitude

Rule of Thumb:

If you're multiplying more than 3-4 numbers, write it out to avoid mistakes
Always sanity check: Does this number feel right?
When uncertain, round to 1 significant figure (1M, not 1.23M)
Use ranges for uncertain values (1-2M, not 1.5M)

What is Back-of-the-Envelope Estimation?​

Why It Matters​

1. Quick Feasibility Assessment​

2. Bottleneck Identification​

3. Cost-Benefit Analysis​

4. Communication​

Common Estimation Scenarios​

Systematic Estimation Framework​

Request Rate Estimation​

Storage Estimation​

Bandwidth Estimation​

Database Capacity Estimation​

Cache Capacity Estimation​

Memory Estimation​

Common Mistakes in BOE Estimation​

1. Being Too Precise​

2. Forgetting Peak vs Average​

3. Ignoring Replication and Overhead​

4. Not Considering Growth​

5. Single-Component Focus​

6. Using Wrong Units​

Reference Numbers to Remember​

Practical Estimation Framework​

Step 1: Clarify Requirements​

Step 2: Identify Key Constraints​

Step 3: Estimate Component by Component​

Step 4: Find Bottleneck​

Step 5: Sanity Check​

BOE in System Design Interviews​

When BOE is Insufficient​

Common Estimation Shortcuts​

What is Back-of-the-Envelope Estimation?

Why It Matters

1. Quick Feasibility Assessment

2. Bottleneck Identification

3. Cost-Benefit Analysis

4. Communication

Common Estimation Scenarios

Systematic Estimation Framework

Request Rate Estimation

Storage Estimation

Bandwidth Estimation

Database Capacity Estimation

Cache Capacity Estimation

Memory Estimation

Common Mistakes in BOE Estimation

1. Being Too Precise

2. Forgetting Peak vs Average

3. Ignoring Replication and Overhead

4. Not Considering Growth

5. Single-Component Focus

6. Using Wrong Units

Reference Numbers to Remember

Practical Estimation Framework

Step 1: Clarify Requirements

Step 2: Identify Key Constraints

Step 3: Estimate Component by Component

Step 4: Find Bottleneck

Step 5: Sanity Check

BOE in System Design Interviews

When BOE is Insufficient

Common Estimation Shortcuts