Join multiple table

Introduction

When joining multiple tables together (such as 3 tables or more), PostgreSQL never joins all tables at the same time, but at any single point in time, it can only join 2 datasets. Suppose you need to join multiple tables, the process of bridging steps will be performed as follows

First, Postgres will select 2 tables to join together to create an Intermediate Result
Postgres will treat the Intermediate Result as a completely new table and will use it to join with the next table to create the next Intermediate Result
This process repeats over and over until all tables have been joined.

Algorithms used

When joining multiple tables, Postgres still uses algorithms like Nested Loop Join, Hash Join and Merge Join
But corresponding to each dataset and the index of the tables, Postgres will choose different algorithms when joining rather than fixing a single algorithm
For example, when joining 4 tables A, B, C and D together
- Suppose table A is small with about 100 rows joining with table B which has an index, Postgres may choose a Nested Loop Join
- After having the Intermediate Result, it joins with table C which has 100,000 rows, so it will choose a Hash Join because it can create a Hash Table in RAM
- Finally, to join with table D which has 1,000,000 rows, it chooses a Merge Join instead

Join order

When executing different Join commands, the Join order can also be different

INNER JOIN / FULL OUTER JOIN
- When using these Join queries, the join order of the tables is absolutely free
- Mathematically, this is an operation with commutative and associative properties like A x B x C = C x B x A, so no matter how Postgres changes the join order, the result remains unchanged
- The Query Planner of Postgres will calculate costs and assume scenarios in advance to select the scenario with the lowest cost to execute
- Its general rule is to always want to shrink data as early as possible. Therefore, it will choose the pair of tables whose join produces the fewest rows, meaning the smallest Intermediate Result to run first
LEFT JOIN / RIGHT JOIN
- These join queries do not have commutative properties, using A LEFT JOIN B gives a completely different result from B LEFT JOIN A.
- So when using them, Postgres will be bound logically and will not be able to automatically change the table join order
Combining multiple join types
- If your query uses a combination of multiple join types together, the execution order is the same as the theory mentioned above, the parts joining with INNER JOIN / FULL OUTER JOIN can have their order changed, while the places using LEFT JOIN / RIGHT JOIN cannot
- Therefore, in a case like this, you should bring the INNER JOINs to the top and leave the LEFT JOIN / RIGHT JOIN / FULL OUTER JOIN parts behind

GEQO (Genetic Query Optimizer)

When you only join 3-4 tables, the Query Planner can calculate all possible order combinations to find the option that costs the least.
However, if you join too many tables, by default from 12 tables or more, the number of order combinations increases significantly, making the Query Planner unable to operate efficiently because the amount of calculation is vast
At this time, Postgres will activate a smart algorithm called GEQO, which will randomly select a few table Join orders and combine them to find the most optimal options
This approach helps Postgres avoid calculating over an excessively large number of combinations while still finding a good enough join plan extremely fast without freezing the system during calculation.

Detail

First, create the tables as follows

CREATE TABLE shipping_methods (
    shipping_id INT PRIMARY KEY,
    provider_code VARCHAR(20) NOT NULL,
    service_name VARCHAR(50),
    is_active BOOLEAN DEFAULT TRUE
);

CREATE TABLE product_variants (
    product_id INT NOT NULL,
    sku TEXT NOT NULL,
    price DECIMAL(12, 2) NOT NULL,
    PRIMARY KEY (product_id, sku)
);

CREATE TABLE orders (
    order_id INT NOT NULL,
    shop_id INT NOT NULL,
    shipping_id INT NOT NULL,
    total_amount DECIMAL(12, 2),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (order_id, shop_id)
);

CREATE TABLE order_items (
    item_id SERIAL PRIMARY KEY,
    order_id INT NOT NULL,
    shop_id INT NOT NULL,
    product_id INT NOT NULL,
    sku TEXT NOT NULL,
    quantity INT NOT NULL
);

Then execute this query

SELECT 
    o.order_id,
    oi.item_id,
    v.sku,
    v.price,
    sm.service_name
FROM order_items oi
INNER JOIN orders o 
    ON oi.order_id = o.order_id AND oi.shop_id = o.shop_id
INNER JOIN shipping_methods sm 
    ON o.shipping_id = sm.shipping_id
INNER JOIN product_variants v 
    ON v.product_id = oi.product_id AND v.sku = oi.sku

Step 1: You can see that the query order I wrote is order_items x orders x shipping_methods x product_variants, but Postgres changed the order to orders x order_items x product_variants x shipping_methods
Step 2: First is a Seq Scan on table orders to create a Hash Table. You can see that the Actual Rows value is 66666.67 and Loops is 3, which means Postgres split the workload among 3 workers running concurrently to scan the table. The number 66666.67 is just the average value calculated from 200,000 / 3. In reality, each individual worker will scan a different integer number of rows.
Step 3: Then a Seq Scan on table order_items and using Hash Inner Join to compare according to Hash Cond: ((oi.order_id = o.order_id) AND (oi.shop_id = o.shop_id))
Step 4: The resulting Intermediate Result will be used to create a Hash Table
Step 5: Then another Seq Scan on table product_variants and using Hash Inner Join to compare according to Hash Cond: ((v.product_id = oi.product_id) AND (v.sku = oi.sku))
Step 6: Next is using Index Scan shipping_methods_pkey on table shipping_methods with Index Cond: (shipping_id = o.shipping_id), the result has only 1 row so it will continue to use Nested Loop Inner Join to join with the Intermediate Result created in step 5