Merge Join

Introduction

In this article, we will learn about Merge Join.

How It Works

Simply put, how Merge Join works is like sorting two lists in ascending order first, then placing them side by side for comparison. Assuming we need to compare the id column of two tables, A and B, the specific process is as follows:

Step 1: Sort Phase
- PostgreSQL must ensure that both lists are already sorted by the order of the join column (for example, from smallest to largest).
- If the two tables are already sorted, such as when the join column is a Primary Key or has an Index, Postgres will skip this step and jump to step 2.
Step 2: Merge Phase
- PostgreSQL will place two pointers starting from the first row of the two tables and begin comparing:
  - If the results on both sides are equal: Return the result, then move the pointer in table B to the next row to continue checking (since data in a column can be duplicated).
  - If the id in table A is smaller: Move the table A pointer to the next row.
  - If the id in table A is larger: Move the table B pointer to the next row.
Thanks to both lists being sorted in order, the two pointers just chase each other from start to finish. Each table only needs to be read exactly once, with no need to go back and check from the beginning.

Advantages

If both tables are joined based on columns that already have an Index, Postgres will skip the Sort step. At this point, using Merge Join is the most optimal, even better than Hash Join because it does not incur the cost of creating a Hash Table in RAM.
Supports comparison operators like <, >, <=, >= rather than being limited to just equality comparisons like Hash Join.
Suitable for extremely large tables that exceed RAM.
- The limitation of Hash Join is that when the Hash Table is too large, it causes RAM overflow and must be written to disk and then each time it checks, it must load all data from disk to RAM.
- Merge Join is different, when RAM is insufficient to hold the data, it will use an External Sort algorithm to disk and only needs to store the information of the row currently being compared and a few subsequent rows as a buffer in RAM.

Limitations

If you have to join 2 tables where the columns do not have an index, choosing Merge Join at this point will incur additional CPU cost to sort the lists, so Postgres may choose Hash Join because it only needs to create a Hash Table in RAM.

Prerequisites

The queries in this article will be reused from previous articles, you can review them if any information is missing.

Detail

First of all, to activate Merge Join, certain conditions must be met, such as the column used for comparison must have an Index, like a B-Tree Index, which means it is already sorted so Merge Join will not incur sorting costs.
Next, the amount of data used for the Join must be large enough for the Query Planner to give up using Hash Join (since creating a Hash Table for large data will cause insufficient memory to operate inefficiently).
Therefore, you should seed a large amount of data beforehand and create an Index for the orders table as follows (since user_id in orders is a foreign key and by default is not an Index, the data is not sorted).

CREATE INDEX idx_orders_user_id ON orders(user_id);

After that, execute the query as follows:

-- Query 1
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
INNER JOIN orders ON users.id = orders.user_id
LIMIT 100

-- Query 2
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
LEFT JOIN orders ON users.id = orders.user_id
LIMIT 100

-- Query 3
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users 
RIGHT JOIN orders ON users.id = orders.user_id
LIMIT 100

-- Query 4
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
FULL OUTER JOIN orders ON users.id = orders.user_id
LIMIT 100

Query 1: Because the user_id column of the orders table and the id column of the users table both already have an Index and are pre-sorted, Postgres will rely on these 2 Indexes and use Merge Inner Join to compare them directly.
- When there is a result, it will rely on the TID of the Index to load the corresponding data in the Heap.
- Please pay attention to the LIMIT 100 part, if you remove this part, you will see that the algorithm switches to Hash Inner Join.
  - The reason is that Merge Join has a Startup cost of 0, comparing by Index but still having to rely on TID to load data in the Heap, which is called Random I/O and is very slow, only suitable when the amount of data to retrieve is small enough, which is why there is a LIMIT 100.
  - If the data to retrieve is large enough, about more than 15% of the table, Postgres will choose Hash Join, which although costs Startup code (to create the Hash Table), uses Seq Scan so it will be faster and more efficient when retrieving a large amount of data.
Query 2: Explained similarly to the above, the algorithm used is Merge Left Join.
Query 3: Here, the query is RIGHT JOIN but also uses the Merge Left Join algorithm because changing the join order in this case does not change the result and does not incur additional costs, so Postgres chooses to bring it to the standard form of the system for processing.
Query 4: Uses Merge Full Join.

You can use the following queries to change Postgres's use of Join algorithms to test the performance between them. Note that this method is only used in a Development environment, while in reality, Postgres's Query Planner has automatically handled it very efficiently based on your query and dataset to select the most optimal solution.

SET enable_hashjoin = off;
SET enable_nestloop = off;
SET enable_mergejoin = off;

RESET enable_hashjoin;
RESET enable_nestloop;
RESET enable_mergejoin;

You can see that if used with Hash Join or Nest Loop Join, the processing time will be slower.