Merge Join
Introduction
In this article, we will learn about Merge Join.
How It Works
Simply put, how Merge Join works is like sorting two lists in ascending order first, then placing them side by side for comparison. Assuming we need to compare the id column of two tables, A and B, the specific process is as follows:
- Step 1: Sort Phase
- PostgreSQL must ensure that both lists are already sorted by the order of the join column (for example, from smallest to largest).
- If the two tables are already sorted, such as when the join column is a Primary Key or has an Index, Postgres will skip this step and jump to step 2.
- Step 2: Merge Phase
- PostgreSQL will place two pointers starting from the first row of the two tables and begin comparing:
- If the results on both sides are equal: Return the result, then move the pointer in table B to the next row to continue checking (since data in a column can be duplicated).
- If the id in table A is smaller: Move the table A pointer to the next row.
- If the id in table A is larger: Move the table B pointer to the next row.
- PostgreSQL will place two pointers starting from the first row of the two tables and begin comparing:
- Thanks to both lists being sorted in order, the two pointers just chase each other from start to finish. Each table only needs to be read exactly once, with no need to go back and check from the beginning.
Advantages
- If both tables are joined based on columns that already have an Index, Postgres will skip the Sort step. At this point, using Merge Join is the most optimal, even better than Hash Join because it does not incur the cost of creating a Hash Table in RAM.
- Supports comparison operators like
<,>,<=,>=rather than being limited to just equality comparisons likeHash Join. - Suitable for extremely large tables that exceed RAM.
- The limitation of Hash Join is that when the Hash Table is too large, it causes RAM overflow and must be written to disk and then each time it checks, it must load all data from disk to RAM.
- Merge Join is different, when RAM is insufficient to hold the data, it will use an External Sort algorithm to disk and only needs to store the information of the row currently being compared and a few subsequent rows as a buffer in RAM.
Limitations
If you have to join 2 tables where the columns do not have an index, choosing Merge Join at this point will incur additional CPU cost to sort the lists, so Postgres may choose Hash Join because it only needs to create a Hash Table in RAM.
Prerequisites
The queries in this article will be reused from previous articles, you can review them if any information is missing.
Detail
- First of all, to activate Merge Join, certain conditions must be met, such as the column used for comparison must have an Index, like a B-Tree Index, which means it is already sorted so Merge Join will not incur sorting costs.
- Next, the amount of data used for the Join must be large enough for the Query Planner to give up using Hash Join (since creating a Hash Table for large data will cause insufficient memory to operate inefficiently).
- Therefore, you should seed a large amount of data beforehand and create an Index for the
orderstable as follows (sinceuser_idinordersis a foreign key and by default is not an Index, the data is not sorted).
CREATE INDEX idx_orders_user_id ON orders(user_id);
After that, execute the query as follows:
-- Query 1
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
INNER JOIN orders ON users.id = orders.user_id
LIMIT 100
-- Query 2
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
LEFT JOIN orders ON users.id = orders.user_id
LIMIT 100
-- Query 3
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
RIGHT JOIN orders ON users.id = orders.user_id
LIMIT 100
-- Query 4
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
FULL OUTER JOIN orders ON users.id = orders.user_id
LIMIT 100
- Query 1: Because the
user_idcolumn of theorderstable and theidcolumn of theuserstable both already have an Index and are pre-sorted, Postgres will rely on these 2 Indexes and useMerge Inner Jointo compare them directly.- When there is a result, it will rely on the
TIDof theIndexto load the corresponding data in the Heap. - Please pay attention to the
LIMIT 100part, if you remove this part, you will see that the algorithm switches toHash Inner Join.- The reason is that
Merge Joinhas a Startup cost of 0, comparing byIndexbut still having to rely onTIDto load data in the Heap, which is calledRandom I/Oand is very slow, only suitable when the amount of data to retrieve is small enough, which is why there is aLIMIT 100. - If the data to retrieve is large enough, about more than
15%of the table, Postgres will chooseHash Join, which although costs Startup code (to create theHash Table), usesSeq Scanso it will be faster and more efficient when retrieving a large amount of data.
- The reason is that
- When there is a result, it will rely on the
- Query 2: Explained similarly to the above, the algorithm used is
Merge Left Join. - Query 3: Here, the query is
RIGHT JOINbut also uses theMerge Left Joinalgorithm because changing the join order in this case does not change the result and does not incur additional costs, so Postgres chooses to bring it to the standard form of the system for processing. - Query 4: Uses
Merge Full Join.
You can use the following queries to change Postgres's use of Join algorithms to test the performance between them. Note that this method is only used in a Development environment, while in reality, Postgres's Query Planner has automatically handled it very efficiently based on your query and dataset to select the most optimal solution.
SET enable_hashjoin = off;
SET enable_nestloop = off;
SET enable_mergejoin = off;
RESET enable_hashjoin;
RESET enable_nestloop;
RESET enable_mergejoin;
You can see that if used with Hash Join or Nest Loop Join, the processing time will be slower.
Happy coding!
Comments
Post a Comment