Using Hash Index

Introduction

In this article, we will explore another type of index, which is the Hash index. True to its name, when using this index type, Postgres passes the value to be indexed through a hash function to generate a number ranging from 0 to 4,294,697,295 with a uint32 (4 bytes) type.

Index Creation Process

As soon as you create a Hash index, Postgres pre-allocates Pages to store the indexes.

If the table does not have data yet, Postgres will still create:
- 1 Metapage: Always located at the first position (Block 0). This page stores the configuration of the index.
- 1 Bitmap page: Used to manage which pages in the file are currently empty to avoid Overflow pages.
- 10 default Primary Bucket Pages.
- Thus, right from the start, your newly created Hash Index file will have a size of 12 Pages x 8KB/Page = 96 KB.
If the table already contains data, based on the configuration information about the Ram limit that Postgres can use, it will create the maximum number of Pages (that it can create) to accommodate the existing column data.
Next, the hash value is converted into a number, then this number is divided by the number of existing Buckets to get the remainder. The result is precisely the Bucket that will hold that Index (for example, 100%30=3, then the index is stored in Bucket 3).
Modulo division to store into Buckets like this is a measure to attempt to distribute the index randomly and evenly across the Buckets.

Distinguish between Page and Bucket as follows:

A Page with an 8KB size is a storage concept on the physical disk.
A Bucket is a logical concept created by Postgres to manage Pages. 1 Bucket will point to the first Page and link to other Pages in the same Bucket (the Pages are actually the linked list of the Bucket).

Based on the above Hash index creation method, you can recognize the following issues:

Identical values will definitely produce identical hash numbers.
After hashing, it generates a uint32 which is only 4 bytes, while the stored values are infinite, so two different values can still generate the same number after hashing.
The Page size is fixed, so when inserting more data and running out of empty Pages, what happens?

Overflow Page

This is precisely the problem that occurs when having to store a lot of identical data, leading to identical hash values (called Hash Collision) and it stores too much into just 1 Page.
A Page has a fixed size, so when there is no more empty space to store, it is called an overflow and it will create an additional new Page to hold the new index.
That newly created Page will be linked with the existing old Page, forming a linked list within the Block.
This causes a huge impact on performance because instead of efficiently finding just 1 Page in a Block, Postgres must use a sequential search through each Page in the Block until it is found.

Bucket Split

This is the exact solution for the Overflow Page by creating more Buckets and redistributing data from the old Bucket, so that each Bucket only has 1 Page to help make querying efficient.
However, the mechanism of Bucket Split only works with unique data and it is not effective with duplicate data.
Therefore, if your data has columns containing too many duplicate values (such as gender or status), you should not choose a Hash Index, because at this point, the cost of using the Index will be even higher than a seq scan, causing the Cost-based optimizer to skip selecting the index for use, resulting in wasted index storage.

Advantages

The cost to search for 1 value in a Hash Index is O(1), which is much lower compared to a Btree Index which is O(log N).

After finding the Bucket in the Hash Index, a comparison of the hash values must still be performed to find the TID
However, this is a fixed cost because the Page has a fixed size, so it is not factored into the algorithm's complexity (except in the case of an Overflow Page).

No cost is required to balance the tree.
The index only stores the Hash number, so the size will be smaller compared to Btree.

Limitations

As mentioned above, if the data duplicates too much, the cost of using the Index is even higher than a Seq Scan.
At the same time, when data duplicates too much, when there is a change in data, updating back into the index also consumes a lot of cost (due to having to traverse through the Pages in the Bucket).
It does not support diverse comparison types like Btree (>, <, >=, <=).

Use Cases

Based on the above characteristics, you should choose a Hash Index when meeting the following conditions:

Unique data such as username, email, phone...
Used in cases of = comparison, because Hash Index only supports this.
The dataset is large enough.
- Because if the data is small, Postgres might choose Seq Scan instead of using the Index and you can replace it with a Btree Index with performance that is not much slower while also supporting other comparison types.
- Only when the data is large enough, leading to the inability to load the entire Btree Index into Ram (as Btree Index is larger than Hash Index) and the cost of balancing the tree becomes too large (Hash Index does not incur tree balancing costs), will the Hash Index achieve its best efficiency. Thus, in practice, you should choose a Btree Index for most cases, only when the data is large enough and you analyze and see that Ram is not sufficient to load the entire Btree Index anymore should you consider using a Hash Index.

Detail

First, let's create a table and add a Hash Index to the columns as follows:

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    username TEXT,
    status TEXT
);
CREATE INDEX idx_hash_user_username ON users USING hash (username);
CREATE INDEX idx_hash_user_status ON users USING hash (status);

After that, you seed data in, try to add unique data to the username column and for the status column, you only need to add a few fixed values.

Next, let's go into detail to see what the content of the Meta Page (Page number 0) of a Hash Index looks like:

CREATE EXTENSION pageinspect;
SELECT * FROM hash_metapage_info(get_raw_page('idx_hash_user_username', 0));

In Postgres, the Meta Page does not contain any search data, instead, it plays the role of the controlling Brain – containing all configuration metrics and the memory allocation state of the Index file.

It can be seen that there is a lot of information, but you only need to focus on the group of metrics regarding data scale:

ntuples = 1.000.000: Is the number of records that the Index is managing.
maxbucket = 3583: Is the number of Buckets, counting from 0 to 3583, the Index data is being randomly distributed into these Buckets.
ffactor = 307 (Fill Factor): Is the density target. Postgres calculates so that each Bucket contains an average of about 307 records. When the amount of data increases causing the average record count to exceed this number, it will automatically trigger the Split Bucket mechanism to expand the Index, limiting Overflow pages.

You can view information about the Data after Hashing, the CTID in the Index and the corresponding record in the Heap as follows:

SELECT * FROM hash_page_items(get_raw_page('idx_hash_user_username', 1));

SELECT 
    ctid,
    *
FROM users
WHERE ctid IN (
    SELECT ctid 
    FROM hash_page_items(get_raw_page('idx_hash_user_username', 1))
);

When querying with a condition on the username column:

select * from users where username = 'Ira41_0_0'

You will see the result is an Index Scan which includes both Scanning the Index and a Heap Scan. It will consist of 2 steps as follows:

Step 1 (At Hash Index): Postgres hashes the string 'Ira41_0_0' into a hash number, then based on this value to calculate the Bucket and retrieve the CTID, because inside the Index at this time there is no data of any column from the users table, so it needs to proceed to step 2.
Step 2 (At Heap table): Based on the existing CTID including the Page number and the row number, Postgres will access the Heap table to get the data of that row to return. It is necessary to distinguish this from an Index Only Scan, which means it only operates within the Index, this case only happens when the Index already has enough data that you need, so Postgres does not need a Heap scan.

But if querying by the status column like:

select * from users where status = 'inactive'

You will see the result is a Seq Scan, which as explained above is due to the data in the status column being duplicated too much leading to overflow pages, the Cost-based Optimizer realizes that scanning in the Index costs even more than a Seq Scan so it bypassed using the Index.

Happy coding!

See more articles here.