Hash Join

Introduction

In this article, we will learn about Hash Join, which is the preferred algorithm when joining two large tables without indexes, or when the data volume exceeds the optimization capability of a Nested Loop.

How It Works

Step 1: Build Phase
- First, PostgreSQL selects the smaller table to load into memory (RAM).
- Next, it takes each item and passes it into a Hash Function to receive a hash number. Postgres relies on this value to sort the results into the corresponding bucket, resulting in the creation of a Hash Table stored in RAM.
Step 2: Probe Phase
- Now, PostgreSQL scans the larger table, also taking each item to pass into the Hash Function just like in step 1.
- Once the hash number is obtained, it only needs to access the corresponding bucket in the Hash Table to check.

Hash Collision

When using a Hash Function, there are still cases where two different values produce the same result, which is called a Hash Collision. In this case, values with the same Hash Number will reside in the same bucket.
Furthermore, when the column used for the join has too many duplicate values, it leads to the creation of many identical Hash Numbers.
- Postgres still has a mechanism to run statistics to identify these most frequently occurring values, known as Most Common Values (MCV).
- When creating the hash table (Build Phase), Postgres dedicates a special memory region called a Skew Bucket to specifically hold these highly duplicated values.
- When scanning the remaining table (Probe Phase), Postgres checks this Skew Bucket area first. Because the duplicate data is located here, it can match a massive amount of data extremely fast without needing to split or spill data to disk (Multi-batch).

Advantages

When using a Hash Join, there is no need to loop multiple times between tables like a Nested Loop Join, as each table only needs to be read once.
- Read the first table to create the Hash Table.
- Read the next table and use the previously created Hash Table for comparison.
Hash Join has a complexity of O(1). Since the number of items in a single bucket is very small, after obtaining the Hash Number, accessing the bucket allows retrieving the value almost immediately.

Limitations

When using a Hash Join, creating a Hash Table in RAM first is mandatory. If the table selected for hashing is too large (exceeding the work_mem capacity), Postgres will have to split the Hash Table and write the remainder to disk. At this point, the speed will decrease significantly because each check requires loading these temporary files from disk into RAM sequentially before completing all comparisons.
Hash Join only works when you join using an equality condition (such as A.id = B.id). Because the generated Hash number is random and cannot be used for comparison, if other comparison types like greater than or less than (A.id > B.id) are used, Postgres will have to select a different join algorithm.

Prerequisites

The queries in this article will be reused from previous articles, which you can review if any information is missing.

Detail

First, let us execute and analyze the following queries:

-- Query 1
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
INNER JOIN orders ON users.id = orders.user_id;

-- Query 2
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
LEFT JOIN orders ON users.id = orders.user_id;

-- Query 3
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users 
RIGHT JOIN orders ON users.id = orders.user_id;

-- Query 4
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
FULL OUTER JOIN orders ON users.id = orders.user_id;

The results are as follows:

Query 1:
- First, a Seq Scan is performed on the users table.
- Then, the Hash step is executed to create a Hash Table based on the id column of the users table.
- Next, a Seq Scan is performed on the orders table to load it into RAM.
- Afterward, the Hash Inner Join is executed, taking each user_id from the orders table to generate a Hash Number and then verifying it within the Hash Table.
- When the corresponding value is found, it returns the information (username, order_id and total_price) of that row.
Query 2: Most steps remain unchanged here, you only need to notice one point: the chosen algorithm is Hash Right Join.
- Because when using a Hash Join, priority is always given to creating a Hash Table for the smaller table, and the users table with fewer records is the chosen target.
- When using a Hash Join, the order of the tables is arranged as follows:
  - Probe table: The table used for comparison, located on the left.
  - Build table: The hash table created first for the small table, located on the right.
- Therefore, when you write a query like users LEFT JOIN orders, since users is on the left side, it does not match the Build table. Postgres will automatically convert the query into orders RIGHT JOIN users. Thus, users is on the right side, which is why it is called a Hash Right Join.
Query 3: Explaining similarly to the above, when querying users RIGHT JOIN orders, users is on the left and is swapped to the right by Postgres to become the Build table as follows: orders LEFT JOIN users. Therefore, the algorithm used is Hash Left Join.
Query 4: Using Hash Full Join, it still selects the users table to create the Hash Table.