Indexes

Why Indexes Matter

Indexes are the single most important performance tuning mechanism in MySQL:

1000x faster queries: Index lookup vs full table scan
Reduced I/O: Read only relevant data pages instead of entire table
Sorted access: B+ Tree maintains data in sorted order
Trade-off: Slower writes (index maintenance), more disk space

Real-world impact:

A missing index on user_id in an orders table with 10M rows can make a simple lookup take 10 seconds instead of 10ms
A single unoptimized query can degrade your entire application's performance

B+ Tree Structure

Why B+ Tree?

MySQL uses B+ Tree (not B Tree) for indexes because:

High fanout: Non-leaf nodes store only keys (not data), allowing more children per node
Shallow tree: 3-4 levels for millions of records (fewer disk I/Os)
Range queries: Leaf nodes linked in sorted order (efficient range scans)
Disk-friendly: Node size matches disk page size (16KB in InnoDB)

Example: Finding row with id = 250

Root: Read root node (1 I/O), determine to go to node C
Internal: Read internal node C (1 I/O), determine to go to leaf F
Leaf: Read leaf node F (1 I/O), find pointer to data page
Data: Read data page (1 I/O), return row

Total: 3-4 I/Os for millions of records (vs millions of I/Os for full table scan)

B+ Tree vs B Tree

Feature	B+ Tree	B Tree
Data storage	Only in leaf nodes	In all nodes
Leaf linking	Yes (linked list)	No
Fanout	Higher (more keys)	Lower
Range queries	Efficient (scan leaves)	Inefficient (tree traversal)
Height	Lower (fewer levels)	Higher

MySQL's choice: B+ Tree for better range query performance and lower tree height.

Index Types

Primary Index (Clustered Index)

Characteristics:

One per table: InnoDB's primary key is the clustered index
Data storage: Leaf nodes store the actual data rows
Sorted: Rows physically sorted by primary key

CREATE TABLE users (
    id INT PRIMARY KEY,           -- Clustered index
    name VARCHAR(100),
    email VARCHAR(100)
);

-- Clustered index lookup (fastest)
SELECT * FROM users WHERE id = 123;

If no primary key: InnoDB uses:

First unique index with NOT NULL columns
Hidden 6-byte row ID (auto-generated, not recommended)

Secondary Index

Characteristics:

Multiple per table: Create as many as needed
Leaf nodes: Store primary key value, not full row
Two lookups: Secondary index → primary key → data row

CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    email VARCHAR(100),
    INDEX idx_email (email)        -- Secondary index
);

-- Two lookups:
-- 1. Find id via idx_email
-- 2. Find row via clustered index on id
SELECT * FROM users WHERE email = 'user@example.com';

Covering index optimization: If secondary index contains all queried columns, skip the second lookup.

Unique Index

Characteristics:

Enforces uniqueness: Prevents duplicate values
Performance: Same as regular index (B+ Tree)
Use case: Email, username, SKU

CREATE TABLE users (
    id INT PRIMARY KEY,
    username VARCHAR(50) UNIQUE,   -- Unique index
    email VARCHAR(100) UNIQUE      -- Unique index
);

-- Error: Duplicate entry 'alice' for key 'username'
INSERT INTO users (username) VALUES ('alice');
INSERT INTO users (username) VALUES ('alice');

Composite Index

Characteristics:

Multiple columns: (name, age, city)
Order matters: Follow leftmost prefix rule
Use case: Multi-column WHERE clauses

CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT,
    city VARCHAR(50),
    INDEX idx_name_age_city (name, age, city)  -- Composite index
);

-- ✅ Uses index (name is first column)
SELECT * FROM users WHERE name = 'Alice';

-- ✅ Uses index (name, age match prefix)
SELECT * FROM users WHERE name = 'Alice' AND age = 25;

-- ✅ Uses index (name, age, city match prefix)
SELECT * FROM users WHERE name = 'Alice' AND age = 25 AND city = 'NYC';

-- ❌ Does NOT use index (skips name)
SELECT * FROM users WHERE age = 25;

-- ✅ Uses index for name, ignores age and city (leftmost prefix)
SELECT * FROM users WHERE name = 'Alice' AND city = 'NYC';

Full-Text Index

Characteristics:

Text search: Search words within text columns
InnoDB support: Since MySQL 5.6
Use case: Article search, product descriptions

CREATE TABLE articles (
    id INT PRIMARY KEY,
    title VARCHAR(200),
    content TEXT,
    FULLTEXT INDEX ft_content (title, content)
);

-- Full-text search
SELECT * FROM articles
WHERE MATCH(title, content) AGAINST('MySQL tutorial' IN NATURAL LANGUAGE MODE);

Leftmost Prefix Rule

Rule Definition

For a composite index (col1, col2, col3), the index can be used for queries that:

Use col1 only
Use col1 AND col2
Use col1 AND col2 AND col3

Cannot skip columns: (col1, col3) skips col2, so col3 is not used.

Examples

-- Index: (name, age, city)
CREATE INDEX idx_user ON users (name, age, city);

-- ✅ Uses index: name matches prefix
SELECT * FROM users WHERE name = 'Alice';
EXPLAIN shows: key=idx_user, ref=const

-- ✅ Uses index: name, age match prefix
SELECT * FROM users WHERE name = 'Alice' AND age = 25;
EXPLAIN shows: key=idx_user, ref=const,const

-- ✅ Uses index: name, age, city match prefix
SELECT * FROM users WHERE name = 'Alice' AND age = 25 AND city = 'NYC';
EXPLAIN shows: key=idx_user, ref=const,const,const

-- ❌ Does NOT use index: skips name
SELECT * FROM users WHERE age = 25;
EXPLAIN shows: key=NULL, type=ALL (full table scan)

-- ✅ Uses index for name, ignores age and city
SELECT * FROM users WHERE name = 'Alice' AND city = 'NYC';
EXPLAIN shows: key=idx_user, ref=const (only name used)

-- ❌ Does NOT use index: name in function
SELECT * FROM users WHERE UPPER(name) = 'ALICE';
EXPLAIN shows: key=NULL, type=ALL

-- ✅ Range on last column: name, age used, city range
SELECT * FROM users WHERE name = 'Alice' AND age = 25 AND city > 'A';
EXPLAIN shows: key=idx_user, type=range

Design Implications

Column order matters:

-- For queries like: WHERE name = ? AND age = ?
-- Best: INDEX (name, age)

-- For queries like: WHERE age = ?
-- Best: INDEX (age)  -- Separate index, not composite

-- For queries like both:
CREATE INDEX idx_name_age ON users (name, age);
CREATE INDEX idx_age ON users (age);  -- Separate index

Covering Index

Definition

A covering index contains all columns required by a query, eliminating the need to access the table data rows.

Examples

-- Index: (user_id, status, created_at)
CREATE INDEX idx_order_stats ON orders (user_id, status, created_at);

-- ✅ Covering index (all columns in index)
SELECT user_id, status, created_at
FROM orders
WHERE user_id = 123 AND status = 'pending';
-- EXPLAIN shows: Extra='Using index' (no table lookup)

-- ❌ Not covering (SELECT * includes other columns)
SELECT * FROM orders
WHERE user_id = 123 AND status = 'pending';
-- EXPLAIN shows: Extra='' (needs table lookup)

-- ✅ Covering index (COUNT(*) uses index)
SELECT COUNT(*)
FROM orders
WHERE user_id = 123 AND status = 'pending';
-- EXPLAIN shows: Extra='Using index'

Benefits

No table lookup: Faster query execution
Reduced I/O: Read only index pages, not data pages
Cached in memory: Index pages more likely to be in buffer pool

Trade-off: Larger index size (more columns stored in index).

Index Design Principles

1. Selective Columns

High cardinality = better index:

-- ✅ Good: Many unique values
CREATE INDEX idx_email ON users (email);  -- Nearly unique

-- ❌ Bad: Few unique values
CREATE INDEX idx_gender ON users (gender);  -- Only 'M', 'F'

-- Rule of thumb: Selectivity > 95% for effective index
SELECT COUNT(DISTINCT email) / COUNT(*) FROM users;  -- 0.98 ✅
SELECT COUNT(DISTINCT gender) / COUNT(*) FROM users;  -- 0.5 ❌

2. Follow Leftmost Prefix

-- Queries: WHERE name = ?, WHERE name = ? AND age = ?
-- Best index: (name, age)
CREATE INDEX idx_name_age ON users (name, age);

-- Not: (age, name)  -- Skips name in first query

3. Consider Covering Indexes

-- Query: SELECT user_id, status, created_at FROM orders WHERE user_id = ?
-- Covering index: (user_id, status, created_at)
CREATE INDEX idx_covering ON orders (user_id, status, created_at);

4. Avoid Over-Indexing

Every index has a cost:

Slower inserts/updates/deletes: Each index must be updated
More disk space: Index pages consume storage
Buffer pool pressure: More indexes compete for memory

Rule: Create indexes for queries that run frequently or are performance-critical.

Common Pitfalls

1. Function on Column

-- ❌ Index invalidation: Function on column
SELECT * FROM users WHERE YEAR(created_at) = 2024;

-- ✅ Rewrite to use range
SELECT * FROM users
WHERE created_at >= '2024-01-01' AND created_at < '2025-01-01';

Why: Functions break index usage because the index stores raw values, not computed results.

2. Type Conversion

-- ❌ Implicit type conversion: phone is VARCHAR
SELECT * FROM users WHERE phone = 13800138000;

-- ✅ Use string literal
SELECT * FROM users WHERE phone = '13800138000';

Why: MySQL converts the column value, not the constant, breaking index usage.

3. Leading Wildcard

-- ❌ Leading wildcard: Cannot use index
SELECT * FROM users WHERE name LIKE '%john%';

-- ✅ Use full-text search or remove leading wildcard
SELECT * FROM users WHERE name LIKE 'john%';  -- Uses index
-- Or use full-text index for search-in-middle
SELECT * FROM articles WHERE MATCH(content) AGAINST('john' IN BOOLEAN MODE);

4. OR Conditions

-- ❌ OR with different columns: May not use index efficiently
SELECT * FROM users WHERE name = 'Alice' OR age = 25;

-- ✅ Use UNION (each query uses index)
SELECT * FROM users WHERE name = 'Alice'
UNION
SELECT * FROM users WHERE age = 25;

5. Negative Conditions

-- ❌ Negative operator: May not use index
SELECT * FROM users WHERE status != 'pending';

-- ✅ Use IN with positive values
SELECT * FROM users WHERE status IN ('paid', 'shipped', 'cancelled');

EXPLAIN Analysis

Key Columns

EXPLAIN SELECT * FROM orders WHERE user_id = 123;

+----+-------------+-------+------+---------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref  | rows | Extra |
+----+-------------+-------+------+---------------+---------+-------+------+-------------+
|  1 | SIMPLE      | orders| ref  | idx_user_id   | idx_u   | 5       | const|  100 | Using index |
+----+-------------+-------+------+---------------+---------+-------+------+-------------+

Type Column (Access Type)

BEST to WORST:

Type	Description	Example
system	Table has 1 row	`SELECT * FROM config WHERE id = 1`
const	Primary key or unique index lookup (1 row)	`WHERE id = 123`
eq_ref	Unique index scan (JOIN)	`JOIN users ON orders.user_id = users.id`
ref	Non-unique index lookup	`WHERE user_id = 123`
range	Index range scan	`WHERE id > 100 AND id < 200`
index	Full index scan	`SELECT COUNT(*) FROM orders`
ALL	Full table scan ❌	`WHERE name = 'Alice'` (no index)

Goal: Avoid type=ALL (full table scan).

Extra Column

Value	Meaning	Good/Bad
Using index	Covering index (no table lookup)	✅ Best
Using where	WHERE clause filtering	✅ Normal
Using filesort	Extra sort pass (ORDER BY not using index)	❌ Bad
Using temporary	Temporary table for GROUP BY/ DISTINCT	❌ Bad
Using index condition	Index condition pushdown optimization	✅ Good

Optimization Examples

1. Deep Pagination

-- ❌ Slow: Scans 1,000,010 rows
SELECT * FROM orders ORDER BY id LIMIT 1000000, 10;

-- ✅ Solution 1: Delayed association
SELECT o.* FROM orders o
INNER JOIN (
    SELECT id FROM orders ORDER BY id LIMIT 1000000, 10
) tmp ON o.id = tmp.id;

-- ✅ Solution 2: Remember last ID
SELECT * FROM orders WHERE id > last_seen_id ORDER BY id LIMIT 10;

2. Avoid Filesort

-- ❌ Filesort: ORDER BY not using index
SELECT * FROM orders WHERE user_id = 123 ORDER BY created_at;
-- Index: (user_id)  -- Not covering ORDER BY

-- ✅ Add covering index
CREATE INDEX idx_user_created ON orders (user_id, created_at);

3. Optimize JOIN

-- ❌ Subquery per row
SELECT * FROM users
WHERE id IN (SELECT user_id FROM orders WHERE status = 'pending');

-- ✅ JOIN (optimizer can optimize)
SELECT DISTINCT u.* FROM users u
INNER JOIN orders o ON u.id = o.user_id
WHERE o.status = 'pending';

Interview Questions

Q1: Why does MySQL use B+ Tree instead of B Tree?

Answer:

Higher fanout: Non-leaf nodes store only keys (not data), allowing more children per node
Shallower tree: Fewer levels = fewer disk I/Os
Range queries: Leaf nodes linked in sorted order, enabling efficient range scans
Disk-friendly: Node size matches disk page size (16KB)

Q2: What's the difference between clustered and secondary indexes?

Answer:

Clustered index: One per table, leaf nodes store actual data rows, primary key is clustered index in InnoDB
Secondary index: Multiple per table, leaf nodes store primary key value, requires two lookups (secondary → primary → data)

Q3: Explain the leftmost prefix rule with examples

Answer: For composite index (A, B, C):

✅ WHERE A = ? (uses A)
✅ WHERE A = ? AND B = ? (uses A, B)
✅ WHERE A = ? AND B = ? AND C = ? (uses A, B, C)
❌ WHERE B = ? (skips A, no index)
✅ WHERE A = ? AND C = ? (uses A, ignores B, C)

Q4: What is a covering index?

Answer: An index that contains all columns required by a query, eliminating the need to access the table data rows. Marked by Using index in EXPLAIN's Extra column.

Example:

-- Index: (user_id, status)
-- Query: SELECT user_id, status FROM orders WHERE user_id = 123
-- ✅ Covering index: No table lookup needed

Q5: Why does `WHERE YEAR(date) = 2024` not use an index?

Answer: The function YEAR(date) computes a value from the column, so the index (which stores raw dates) cannot be used. Rewrite to use range:

WHERE date >= '2024-01-01' AND date < '2025-01-01'

Q6: How do you optimize `SELECT * FROM table LIMIT 1000000, 10`?

Answer:

Delayed association: Use subquery to fetch only primary keys, then join to full rows
Remember last ID: WHERE id > last_seen_id ORDER BY id LIMIT 10
Seek method: Use covering index to seek to offset

Q7: What's the difference between `type=ref`, `type=range`, and `type=ALL`?

Answer:

ref: Non-unique index lookup (e.g., WHERE user_id = 123)
range: Index range scan (e.g., WHERE id > 100 AND id < 200)
ALL: Full table scan (no index used) ❌

Why Indexes Matter​

B+ Tree Structure​

Why B+ Tree?​

B+ Tree vs B Tree​

Index Types​

Primary Index (Clustered Index)​

Secondary Index​

Unique Index​

Composite Index​

Full-Text Index​

Leftmost Prefix Rule​

Rule Definition​

Examples​

Design Implications​

Covering Index​

Definition​

Examples​

Benefits​

Index Design Principles​

1. Selective Columns​

2. Follow Leftmost Prefix​

3. Consider Covering Indexes​

4. Avoid Over-Indexing​

Common Pitfalls​

1. Function on Column​

2. Type Conversion​

3. Leading Wildcard​

4. OR Conditions​

5. Negative Conditions​

EXPLAIN Analysis​

Key Columns​

Type Column (Access Type)​

Extra Column​

Optimization Examples​

1. Deep Pagination​

2. Avoid Filesort​

3. Optimize JOIN​

Interview Questions​

Q1: Why does MySQL use B+ Tree instead of B Tree?​

Q2: What's the difference between clustered and secondary indexes?​

Q3: Explain the leftmost prefix rule with examples​

Q4: What is a covering index?​

Q5: Why does WHERE YEAR(date) = 2024 not use an index?​

Q6: How do you optimize SELECT * FROM table LIMIT 1000000, 10?​

Q7: What's the difference between type=ref, type=range, and type=ALL?​

Further Reading​

Why Indexes Matter

B+ Tree Structure

Why B+ Tree?

B+ Tree vs B Tree

Index Types

Primary Index (Clustered Index)

Secondary Index

Unique Index

Composite Index

Full-Text Index

Leftmost Prefix Rule

Rule Definition

Examples

Design Implications

Covering Index

Definition

Examples

Benefits

Index Design Principles

1. Selective Columns

2. Follow Leftmost Prefix

3. Consider Covering Indexes

4. Avoid Over-Indexing

Common Pitfalls

1. Function on Column

2. Type Conversion

3. Leading Wildcard

4. OR Conditions

5. Negative Conditions

EXPLAIN Analysis

Key Columns

Type Column (Access Type)

Extra Column

Optimization Examples

1. Deep Pagination

2. Avoid Filesort

3. Optimize JOIN

Interview Questions

Q1: Why does MySQL use B+ Tree instead of B Tree?

Q2: What's the difference between clustered and secondary indexes?

Q3: Explain the leftmost prefix rule with examples

Q4: What is a covering index?

Q5: Why does `WHERE YEAR(date) = 2024` not use an index?

Q6: How do you optimize `SELECT * FROM table LIMIT 1000000, 10`?

Q7: What's the difference between `type=ref`, `type=range`, and `type=ALL`?

Further Reading