The problem: the full table scan
When you run a query with a filter, by default the database does a full table scan: it goes through all the rows one by one to check which ones meet the condition.
SELECT * FROM orders WHERE customer_id = 11;
With 10 rows you won't notice. With 10 million rows, scanning them all on every query is unfeasible. The cost grows linearly: O(n).
The solution: an index
An index is an auxiliary, sorted data structure that the database maintains separately from the table. It works like the alphabetical index of a book: instead of reading the whole book looking for a word, you go to the (sorted) index and jump straight to the page.
Internally, most engines (SQLite, PostgreSQL, MySQL) use a B+ tree (B-tree): a balanced tree where each level divides the search space. Looking up a value costs O(log n): with 10 million rows, instead of 10,000,000 comparisons, ~24 are enough.
| Rows (n) | Scan O(n) | Index O(log n) |
|---|---|---|
| 1,000 | 1,000 | ~10 |
| 1,000,000 | 1,000,000 | ~20 |
| 1,000,000,000 | 1,000,000,000 | ~30 |
Create an index
CREATE INDEX idx_orders_customer ON orders (customer_id);
From that moment on, queries that filter (WHERE), sort
(ORDER BY) or group by customer_id can use the index. The primary
key (PRIMARY KEY) is indexed automatically; that's why searching by
id is already fast without doing anything.
An index can span several columns (composite index). The order
matters: idx (country, status) works to filter by country or by
country AND status, but not to filter only by status.
CREATE INDEX idx_orders_country_status ON orders (country, status);
When should you create an index?
Create indexes on the columns that appear frequently in:
WHERE(filters), especially on large tables.JOIN ... ON(foreign keys almost always deserve an index).ORDER BYandGROUP BY.
The cost: they aren't free
Indexes aren't free magic. They have a price:
- Slower writes. Every
INSERT,UPDATEorDELETEmust also update all affected indexes. A table with many indexes writes more slowly. - Disk space. The index is a sorted copy of the columns; it takes up additional memory and disk.
- Selectivity. An index on a column with few distinct values
(e.g.
statuswith only 3 values) provides little: the engine might prefer a scan anyway.
Rule of thumb: index what you query a lot, not everything. An unused index is pure cost with no benefit.
See what the planner does: EXPLAIN QUERY PLAN
To find out whether a query uses an index or scans the table, prepend
EXPLAIN QUERY PLAN (SQLite syntax; in PostgreSQL/MySQL it's EXPLAIN):
EXPLAIN QUERY PLAN
SELECT * FROM orders WHERE customer_id = 11;
SCAN orders→ full scan (no useful index).SEARCH orders USING INDEX idx_orders_customer→ it uses the index!
It's the #1 tool for diagnosing slow queries: first measure, don't guess.
Examples
Create an index and query using it
CREATE INDEX idx_orders_customer ON orders (customer_id);
SELECT id, total
FROM orders
WHERE customer_id = 11
ORDER BY id;
Diagnose the execution plan
EXPLAIN QUERY PLAN
SELECT * FROM orders WHERE customer_id = 11;