Hash Join
Introduction
In this article, we will learn about Hash Join, which is the preferred algorithm when joining two large tables without indexes, or when the data volume exceeds the optimization capability of a Nested Loop.
How It Works
- Step 1: Build Phase
- First, PostgreSQL selects the smaller table to load into memory (RAM).
- Next, it takes each item and passes it into a Hash Function to receive a hash number. Postgres relies on this value to sort the results into the corresponding bucket, resulting in the creation of a Hash Table stored in RAM.
- Step 2: Probe Phase
- Now, PostgreSQL scans the larger table, also taking each item to pass into the Hash Function just like in step 1.
- Once the hash number is obtained, it only needs to access the corresponding bucket in the Hash Table to check.
Hash Collision
- When using a Hash Function, there are still cases where two different values produce the same result, which is called a Hash Collision. In this case, values with the same Hash Number will reside in the same bucket.
- Furthermore, when the column used for the join has too many duplicate values, it leads to the creation of many identical Hash Numbers.
- Postgres still has a mechanism to run statistics to identify these most frequently occurring values, known as Most Common Values (MCV).
- When creating the hash table (Build Phase), Postgres dedicates a special memory region called a Skew Bucket to specifically hold these highly duplicated values.
- When scanning the remaining table (Probe Phase), Postgres checks this Skew Bucket area first. Because the duplicate data is located here, it can match a massive amount of data extremely fast without needing to split or spill data to disk (Multi-batch).
Advantages
- When using a Hash Join, there is no need to loop multiple times between tables like a Nested Loop Join, as each table only needs to be read once.
- Read the first table to create the Hash Table.
- Read the next table and use the previously created Hash Table for comparison.
- Hash Join has a complexity of O(1). Since the number of items in a single bucket is very small, after obtaining the Hash Number, accessing the bucket allows retrieving the value almost immediately.
Limitations
- When using a Hash Join, creating a Hash Table in RAM first is mandatory. If the table selected for hashing is too large (exceeding the work_mem capacity), Postgres will have to split the Hash Table and write the remainder to disk. At this point, the speed will decrease significantly because each check requires loading these temporary files from disk into RAM sequentially before completing all comparisons.
- Hash Join only works when you join using an equality condition (such as A.id = B.id). Because the generated Hash number is random and cannot be used for comparison, if other comparison types like greater than or less than (A.id > B.id) are used, Postgres will have to select a different join algorithm.
Prerequisites
The queries in this article will be reused from previous articles, which you can review if any information is missing.
Detail
First, let us execute and analyze the following queries:
-- Query 1
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
INNER JOIN orders ON users.id = orders.user_id;
-- Query 2
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
LEFT JOIN orders ON users.id = orders.user_id;
-- Query 3
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
RIGHT JOIN orders ON users.id = orders.user_id;
-- Query 4
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
FULL OUTER JOIN orders ON users.id = orders.user_id;
The results are as follows:
- Query 1:
- First, a
Seq Scanis performed on theuserstable. - Then, the
Hashstep is executed to create aHash Tablebased on theidcolumn of theuserstable. - Next, a
Seq Scanis performed on theorderstable to load it into RAM. - Afterward, the
Hash Inner Joinis executed, taking eachuser_idfrom theorderstable to generate aHash Numberand then verifying it within theHash Table. - When the corresponding value is found, it returns the information (
username,order_idandtotal_price) of that row.
- First, a
- Query 2: Most steps remain unchanged here, you only need to notice one point: the chosen algorithm is
Hash Right Join.- Because when using a
Hash Join, priority is always given to creating aHash Tablefor the smaller table, and theuserstable with fewer records is the chosen target. - When using a Hash Join, the order of the tables is arranged as follows:
Probe table: The table used for comparison, located on the left.Build table: The hash table created first for the small table, located on the right.
- Therefore, when you write a query like
users LEFT JOIN orders, sinceusersis on the left side, it does not match theBuild table. Postgres will automatically convert the query intoorders RIGHT JOIN users. Thus,usersis on the right side, which is why it is called aHash Right Join.
- Because when using a
- Query 3: Explaining similarly to the above, when querying
users RIGHT JOIN orders,usersis on the left and is swapped to the right by Postgres to become theBuild tableas follows:orders LEFT JOIN users. Therefore, the algorithm used isHash Left Join. - Query 4: Using
Hash Full Join, it still selects theuserstable to create theHash Table.
Happy coding!
Comments
Post a Comment