How Index and B-Tree Index Work

Introduction

Index

When creating an Index, PostgreSQL creates a separate physical file on the disk. Each index has a Relfilenode which is a unique identifier number. An index does not store the entire information of a row but only contains Index Entries, each entry including:

Key: The value of the column you index.
TID (Tuple Identifier): A physical pointer consisting of a BlockNumber and an OffsetNumber, used for reference to point to the location of that row in the main table (HEAP).

When executing a query, the Query Planner Cost Model in Postgres will calculate to choose between reading data from the index or retrieving it directly from the HEAP (Sequential Scan).

{Index Scan Cost} = {Index file reading cost} + {Random block reading cost in the table}
{Seq Scan Cost} = {Sequential block reading cost in the table}

Because the Index only contains the TID to point to the data in the main table, after loading the index content and finding the necessary values, it must perform random I/O according to the TID in the index to access the main table and retrieve the corresponding row content. This leads to the following cases:

If you index and query suitable fields already available in the index, Postgres only needs to read from the index to get data without using the TID to access the main table, which is the best case and the speed will be very fast.
If you need to retrieve data from columns that do not have an index but the data amount is moderate, Postgres uses the index to quickly search for suitable values and then relies on the TID to get the row in the main table, which is also fast.
If you need to query too many column data not present in the index, the Planner Cost Model will calculate. If it finds that reading the index and then reading the main table to get too much data (usually exceeding 10% to 15% of the total rows of the table, depending on hardware configuration and row size), it will not use the Index anymore and switch to Seq Scan directly in the main table instead.

Postgres supports many different types of Indexes, but first we will learn about the B-Tree Index.

B-Tree Index

B-Tree (Balanced Tree) is the default and most popular index type in PostgreSQL. It performs storage according to the B-tree data structure, making data retrieval very fast and efficient compared to a seq scan in the HEAP.
A B-Tree is divided into multiple levels, including the following components:

Root Node: The first touchpoint when you search. It contains value ranges to guide you down to the lower level.
Internal Nodes: The intermediate levels also contain value ranges and TIDs to corresponding nodes. If a value matches, it will rely on that TID to go to the next node.
Leaf Nodes: The bottom level of the tree. This is where column values are stored along with a TID pointing to the position of that data row on the HEAP.

B-Trees can self-balance to always automatically keep the height of the tree equal across all branches. This means whether you search for the smallest or largest number, the number of movement steps (number of disk reads) is the same.

The search process with an index

Note that a node in a B-Tree contains multiple values from the indexed column, but the number of those values is not fixed and depends on the size of the column data type.
You can calculate as follows: {number of values in 1 node} = {size of 1 page (~8kb)} / {size of column data type}

Thus, with columns having small data type sizes like INT (4 bytes), a node can contain more than types like BIGINT (8 bytes).
A node containing more values means the B-Tree will have fewer nodes and fewer levels, which leads to a faster tree traversal speed when searching for values in the index (due to processing fewer nodes).

Once the node containing the values is found, Postgres will use a Binary search to find values according to the condition, so it will be extremely fast.
As you can see, to optimize performance when using a B-Tree Index, you should place the index on columns with data types that are as compact as possible.

Example

B-Tree Structure Components

Each Index Entry consists of 2 main components:

Key: The value of the indexed column (for example, using a BIGINT type takes 8 bytes).
Pointer:
- At upper levels (Root, Internal): The pointer points to the Page ID of the next child level.
- At the bottom level (Leaf): The pointer is the Tuple ID (TID), which consists of a Page and an Offset to locate the exact row within the Heap.

To simplify the math, let's assume an average of 24 bytes/entry (including both Key + Pointer). The fan-out of an 8KB Page will be $8000 / 24 \approx 333$ entries/Page. This means any given Page in the tree can branch out to a maximum of 333 child nodes.

B-Tree Model

Assuming a table has 100 million records, with a branching factor (fan-out) of 333, the tree structure distributes from top to bottom as follows:

Level 1 (Root Node): Contains only a single Page holding a maximum of 333 Page IDs to branch down to the 333 Pages of Level 2.
Level 2 (Internal Nodes): Contains a maximum of 333 Pages, where each Page holds another 333 Page IDs. The total number of Page IDs at this level is 333 x 333 = ~110,000 to branch down to Level 3.
Level 3 (Internal Nodes): Contains a maximum of 110,000 Pages, with each Page continuing to hold 333 Page IDs. The total number of Page IDs at this level is 110,000 x 333 = ~36,600,000 to branch down to the Leaf Level.
Level 4 (Leaf Nodes): Contains a maximum of 36,600,000 Pages. This is the level that holds the actual TIDs. The maximum number of data rows this level can manage is: 36,600,000 x 333 rows/Page = ~12 billion rows.

How Index Searching Works

Suppose you run the following statement: SELECT * FROM table_name WHERE id = 87,654,321. The B-Tree algorithm in Postgres will perform a Binary Search on each Page to navigate rapidly:

Step 1: Read the Root Page (Level 1)
- Postgres loads the Root Page into memory. This page contains value ranges, for example: [up to 10M -> points to Page A], [10M to 50M -> points to Page B], [50M to 90M -> points to Page C].
- The number 87,654,321 falls within the 50M - 90M range. Postgres knows it needs to move down to Page C at Level 2.
Step 2: Read Page C (Level 2)
- Postgres loads Page C. This page further subdivides the value range from 50M to 90M into 333 smaller ranges.
- Using a comparison algorithm, Postgres finds the range containing 87,654,321 and retrieves the ID for Page X at Level 3.
Step 3: Read Page X (Level 3)
- Similarly, Page X subdivides the value range even deeper. Postgres scans the entries and determines that 87,654,321 must reside in Leaf Page Y (Level 4).
Step 4: Read Leaf Page Y (Level 4)
- Postgres loads Leaf Page Y. Here, it locates the exact Key: 87,654,321.
- Attached to this Key is the TID value (e.g., Block/Page: 45102, Offset: 12).

By the end of the index lookup, Postgres has found the exact TID by reading only 4 Pages within the index. After that, it only needs to use this TID to perform one additional read on Page 45102 on the disk to fetch the full row data.

You can see the visual model below:

When changing data

When data changes, the B-Tree will rearrange the nodes to balance the tree as follows:

When INSERT: Postgres will find the appropriate leaf node to insert the new value. If that leaf node is full, it will perform Node Splitting to split that node in half, meaning adding an intermediate node and pushing it to the upper level.
When DELETE: Because Postgres uses the MVCC (Multi-Version Concurrency Control) mechanism, when the value to be deleted is found, it only adds a dead tuple flag so that subsequent query operations will skip this value instead of deleting it completely on disk.
When UPDATE: This is performed similarly to Delete + Insert.

Similar to changing data in the HEAP, the Autovacuum mechanism will run in the background to delete dead values. After deletion, if that node is empty (because all values have been deleted), postgres will perform node merging to reduce the number of nodes.

Thus, you can see that this tree self-balancing helps queries in the index run very fast, but it will also significantly affect performance if your query leads to splitting or merging nodes.
Because this action is performed at the leaf node and will propagate up to the intermediate node and then to the root node, changing the B-Tree structure.
Therefore, the larger your data, the larger the B-Tree will be and balancing the tree will consume a correspondingly large processing cost.

Index Segmentation

Just like a Heap file, an Index file stored on disk also has a default limit of 1GB, if an Index is larger than this limit, it must be split into multiple files.
However, splitting an Index like that does not affect the logical structure of the B-tree, because Postgres has an automatic translator from the Page number to the exact position in a specific file on disk.
- For example, an Index is split into 3 files named file_index , file_index.1 and file_index.2.
- Suppose the TID storing the Page is 200.000.
- Each Index file is 1GB, 1 Page is 8KB, so 1 Index file has 131.072 Pages.
- Calculate 200.000 / 131.072 = 1.52.
- The integer part is 1, which means this Page is located in file_index.1.
- The remainder is 200.000 - 131.072 = 68.928, which means it is located at the 68.928th position inside that file.
- Postgres simply opens file_index.1, then seeks to the 68.928th Page to read the data of that node.

Data type

Btree supports well with popular data type groups in Postgres as follows:

Numeric Types:

Integers: smallint, integer, bigint.
Decimals/Reals: numeric, decimal, real, double precision.
Very fast when searching for IDs, prices, quantities, scores...

String Types: char, varchar, text

How B-Tree handles: It sorts these strings alphabetically (Alphabetical / Collation).
Used for exact matching of names, product codes or prefix searches like WHERE name LIKE 'NameValue%'.

Date/Time Types: date, time, timestamp, timestamptz (time with time zone), interval

Used in filtering data by date, finding system logs, revenue reports by month/year.

Boolean Type: true / false

Usually, a separate B-Tree index is rarely used for a Boolean column because the data distinctness is too low (only 2 values). However, it is extremely useful when making a Composite Index like (status, created_at).

UUID Type

B-Tree runs very well with UUID. However, if UUID v4 is used, it usually generates random data, which will cause continuous Node Splitting. Therefore, the current trend is to use UUID v7 (with an incremental time element) for better B-Tree operation.

Binary Data: bytea.

Btree Index and Primary key

In PostgreSQL, when you create a table and specify a column as a Primary Key, Postgres automatically does two things:

Creates a Unique Constraint on that column.
Automatically creates a B-Tree index to serve this constraint.

Thus, the Primary Key in Postgres always uses the B-Tree index by default and cannot be replaced by any other Index type (such as Hash, GIN,...) and in addition to the data types provided above, if you use any data type that Btree cannot operate on to make a primary key, an error will be reported during initialization.

Advantages of using Btree index as primary key:

Super-speed Uniqueness Check: When you INSERT a new id, Postgres needs to check if this id already exists. Thanks to the B-Tree structure, it only takes a few steps to know if that id is already in the table to block it if duplicated.
Fastest Record Search: Statements like WHERE id = value occur continuously, so B-Tree helps find that data row very quickly.
Sorting Support: Primary Keys are usually Sequence numbers or UUIDs. B-Tree is extremely suitable for managing and sorting this ordered data.

Use case

B-Tree Index will be effective for the following cases:

Exact search (=) and comparison operators like <, <=, >, >=, IN.
Range search (for example, WHERE date_value BETWEEN '2026-01-01' AND '2026-06-01')
Data sorting (ORDER BY) because the nature of B-Tree leaf nodes is already sorted from left to right.

Detail

CREATE TABLE Product (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255) NOT NULL,
  price DOUBLE PRECISION NOT NULL,
  "categoryId" INTEGER NOT NULL
);

After inserting an amount of data, you can check that the Btree index has been created by default for the primary key of that table like this:

SELECT
    i.relname AS "Index Name",
    ix.indisunique AS "Unique?",
    ix.indisprimary AS "PK?",
    pg_size_pretty(pg_relation_size(i.oid)) AS "Volume",
    idx.indexdef AS "Create Index Query"
FROM
    pg_index ix
JOIN
    pg_class t ON t.oid = ix.indrelid
JOIN
    pg_class i ON i.oid = ix.indexrelid
JOIN
    pg_namespace n ON n.oid = t.relnamespace
JOIN
    pg_indexes idx ON idx.indexname = i.relname AND idx.schemaname = n.nspname
WHERE
    t.relkind = 'r' 
    AND n.nspname = 'public' 
    AND t.relname = 'Product'
ORDER BY 
    pg_relation_size(i.oid) DESC;

The columns are all easy to understand so I will not explain much, only the Unique column shows whether duplicate data between rows is allowed and since this is a primary key, it certainly cannot be duplicated.

Next, when querying data, you can use EXPLAIN or EXPLAIN ANALYZE with the following meanings:

EXPLAIN:

Does not run the statement. Postgres only uses the Query Planner and old statistics to predict the execution plan.
Therefore, the response speed is very fast no matter how much data your query has to process.
There is no risk to data because it does not actually run the query.
Accuracy is for reference only because it is just a cost estimation.
Displayed information is general, including Scan plan, expected rows (Rows), expected cost (Cost).

EXPLAIN ANALYZE

Actually runs the statement in the Database to measure actual parameters, then outputs the report.
Response speed depends on whether the query has to process a lot or a little.
When running with insert/update/delete queries, data will be actually changed, so be careful in production environments.
High accuracy because this is actual execution.
Displayed information includes all information when using EXPLAIN plus Actual time, actual rows returned, memory/disk consumption.

EXPLAIN SELECT * FROM public."Product" where id = 100

EXPLAIN ANALYZE
SELECT * FROM public."Product" where id = 100

When running EXPLAIN

The content is very clear as you can see:

Postgres used Index Scan with Product_pkey as the index name of the Product table.
cost=0.42..8.44:

0.42 (Start-up cost) is the cost consumed for the system to find the Index and prepare to scan.
8.44 (Total cost) is the estimated score to complete the entire statement and return data to you.

rows=1: Postgres predicts this statement will return exactly 1 data row (Since id is the primary key, each ID has only 1 unique row).
width=38: The average size of the returned data row is estimated to be 38 bytes, which is the value of all columns.

When running EXPLAIN ANALYZE

actual time=0.034..0.035

0.034 ms: The actual time for Postgres to find the first row satisfying the condition.
0.035 ms: The actual time for Postgres to complete scanning and retrieve the entire data row.

rows=1.00: The number of actual data rows found (Perfectly matches the rows=1 prediction when using EXPLAIN).
loops=1: This Index Scan step only needs to be executed exactly 1 time (If the statement has loops or complex JOINs, the number of loops might be higher).
Index Searches: 1: Meaning Postgres only needs to perform 1 single lookup on the B-Tree Index to find the position of the id. This is the most optimal level.
Buffers: shared hit=4: indicates how Postgres interacts with RAM memory (Shared Buffers):

shared hit=4: Postgres read a total of 4 data blocks (each block is 8KB by default) and all 4 blocks were already in RAM (hit), no need to spend effort reading from Disk.
This is extremely good because reading data from RAM is much faster than reading from disk. If the result shows read=4 instead of hit=4, it means Postgres had to load from disk to get data.

Planning Time: 0.188 ms is the time for the Postgres scheduler to think, calculate alternatives and offer the most optimal execution plan for the SQL query.
Execution Time: 0.093 ms is the actual time for the Database to run the query, scan the Index, get data from RAM and return results.
The total actual execution time of this query is: 0.188 + 0.093 = 0.281 ms.