Indexes: speeding up searches — Performance, transactions and NoSQL

The problem: the full table scan

When you run a query with a filter, by default the database does a full table scan: it goes through all the rows one by one to check which ones meet the condition.

SELECT * FROM orders WHERE customer_id = 11;

With 10 rows you won't notice. With 10 million rows, scanning them all on every query is unfeasible. The cost grows linearly: O(n).

The solution: an index

An index is an auxiliary, sorted data structure that the database maintains separately from the table. It works like the alphabetical index of a book: instead of reading the whole book looking for a word, you go to the (sorted) index and jump straight to the page.

Internally, most engines (SQLite, PostgreSQL, MySQL) use a B+ tree (B-tree): a balanced tree where each level divides the search space. Looking up a value costs O(log n): with 10 million rows, instead of 10,000,000 comparisons, ~24 are enough.

Rows (n)	Scan O(n)	Index O(log n)
1,000	1,000	~10
1,000,000	1,000,000	~20
1,000,000,000	1,000,000,000	~30

Create an index

CREATE INDEX idx_orders_customer ON orders (customer_id);

From that moment on, queries that filter (WHERE), sort (ORDER BY) or group by customer_id can use the index. The primary key (PRIMARY KEY) is indexed automatically; that's why searching by id is already fast without doing anything.

An index can span several columns (composite index). The order matters: idx (country, status) works to filter by country or by country AND status, but not to filter only by status.

CREATE INDEX idx_orders_country_status ON orders (country, status);

When should you create an index?

Create indexes on the columns that appear frequently in:

WHERE (filters), especially on large tables.
JOIN ... ON (foreign keys almost always deserve an index).
ORDER BY and GROUP BY.

The cost: they aren't free

Indexes aren't free magic. They have a price:

Slower writes. Every INSERT, UPDATE or DELETE must also update all affected indexes. A table with many indexes writes more slowly.
Disk space. The index is a sorted copy of the columns; it takes up additional memory and disk.
Selectivity. An index on a column with few distinct values (e.g. status with only 3 values) provides little: the engine might prefer a scan anyway.

Rule of thumb: index what you query a lot, not everything. An unused index is pure cost with no benefit.

See what the planner does: EXPLAIN QUERY PLAN

To find out whether a query uses an index or scans the table, prepend EXPLAIN QUERY PLAN (SQLite syntax; in PostgreSQL/MySQL it's EXPLAIN):

EXPLAIN QUERY PLAN
SELECT * FROM orders WHERE customer_id = 11;

SCAN orders → full scan (no useful index).
SEARCH orders USING INDEX idx_orders_customer → it uses the index!

It's the #1 tool for diagnosing slow queries: first measure, don't guess.

Examples

Create an index and query using it

CREATE INDEX idx_orders_customer ON orders (customer_id);

SELECT id, total
FROM orders
WHERE customer_id = 11
ORDER BY id;

Diagnose the execution plan

EXPLAIN QUERY PLAN
SELECT * FROM orders WHERE customer_id = 11;