Join multiple table
Introduction
When joining multiple tables together (such as 3 tables or more), PostgreSQL never joins all tables at the same time, but at any single point in time, it can only join 2 datasets. Suppose you need to join multiple tables, the process of bridging steps will be performed as follows
- First, Postgres will select 2 tables to join together to create an Intermediate Result
- Postgres will treat the Intermediate Result as a completely new table and will use it to join with the next table to create the next Intermediate Result
- This process repeats over and over until all tables have been joined.
Algorithms used
- When joining multiple tables, Postgres still uses algorithms like Nested Loop Join, Hash Join and Merge Join
- But corresponding to each dataset and the index of the tables, Postgres will choose different algorithms when joining rather than fixing a single algorithm
- For example, when joining 4 tables A, B, C and D together
- Suppose table A is small with about 100 rows joining with table B which has an index, Postgres may choose a Nested Loop Join
- After having the Intermediate Result, it joins with table C which has 100,000 rows, so it will choose a Hash Join because it can create a Hash Table in RAM
- Finally, to join with table D which has 1,000,000 rows, it chooses a Merge Join instead
Join order
When executing different Join commands, the Join order can also be different
- INNER JOIN / FULL OUTER JOIN
- When using these Join queries, the join order of the tables is absolutely free
- Mathematically, this is an operation with commutative and associative properties like A x B x C = C x B x A, so no matter how Postgres changes the join order, the result remains unchanged
- The Query Planner of Postgres will calculate costs and assume scenarios in advance to select the scenario with the lowest cost to execute
- Its general rule is to always want to shrink data as early as possible. Therefore, it will choose the pair of tables whose join produces the fewest rows, meaning the smallest Intermediate Result to run first
- LEFT JOIN / RIGHT JOIN
- These join queries do not have commutative properties, using A LEFT JOIN B gives a completely different result from B LEFT JOIN A.
- So when using them, Postgres will be bound logically and will not be able to automatically change the table join order
- Combining multiple join types
- If your query uses a combination of multiple join types together, the execution order is the same as the theory mentioned above, the parts joining with INNER JOIN / FULL OUTER JOIN can have their order changed, while the places using LEFT JOIN / RIGHT JOIN cannot
- Therefore, in a case like this, you should bring the INNER JOINs to the top and leave the LEFT JOIN / RIGHT JOIN / FULL OUTER JOIN parts behind
GEQO (Genetic Query Optimizer)
- When you only join 3-4 tables, the Query Planner can calculate all possible order combinations to find the option that costs the least.
- However, if you join too many tables, by default from 12 tables or more, the number of order combinations increases significantly, making the Query Planner unable to operate efficiently because the amount of calculation is vast
- At this time, Postgres will activate a smart algorithm called GEQO, which will randomly select a few table Join orders and combine them to find the most optimal options
- This approach helps Postgres avoid calculating over an excessively large number of combinations while still finding a good enough join plan extremely fast without freezing the system during calculation.
Detail
First, create the tables as follows
CREATE TABLE shipping_methods (
shipping_id INT PRIMARY KEY,
provider_code VARCHAR(20) NOT NULL,
service_name VARCHAR(50),
is_active BOOLEAN DEFAULT TRUE
);
CREATE TABLE product_variants (
product_id INT NOT NULL,
sku TEXT NOT NULL,
price DECIMAL(12, 2) NOT NULL,
PRIMARY KEY (product_id, sku)
);
CREATE TABLE orders (
order_id INT NOT NULL,
shop_id INT NOT NULL,
shipping_id INT NOT NULL,
total_amount DECIMAL(12, 2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (order_id, shop_id)
);
CREATE TABLE order_items (
item_id SERIAL PRIMARY KEY,
order_id INT NOT NULL,
shop_id INT NOT NULL,
product_id INT NOT NULL,
sku TEXT NOT NULL,
quantity INT NOT NULL
);
Then execute this query
SELECT
o.order_id,
oi.item_id,
v.sku,
v.price,
sm.service_name
FROM order_items oi
INNER JOIN orders o
ON oi.order_id = o.order_id AND oi.shop_id = o.shop_id
INNER JOIN shipping_methods sm
ON o.shipping_id = sm.shipping_id
INNER JOIN product_variants v
ON v.product_id = oi.product_id AND v.sku = oi.sku- Step 1: You can see that the query order I wrote is
order_itemsxordersxshipping_methodsxproduct_variants, but Postgres changed the order toordersxorder_itemsxproduct_variantsxshipping_methods - Step 2: First is a Seq Scan on table
ordersto create a Hash Table. You can see that the Actual Rows value is 66666.67 and Loops is 3, which means Postgres split the workload among 3 workers running concurrently to scan the table. The number 66666.67 is just the average value calculated from 200,000 / 3. In reality, each individual worker will scan a different integer number of rows. - Step 3: Then a Seq Scan on table
order_itemsand usingHash Inner Jointo compare according toHash Cond: ((oi.order_id = o.order_id) AND (oi.shop_id = o.shop_id)) - Step 4: The resulting
Intermediate Resultwill be used to create aHash Table - Step 5: Then another Seq Scan on table
product_variantsand usingHash Inner Jointo compare according toHash Cond: ((v.product_id = oi.product_id) AND (v.sku = oi.sku)) - Step 6: Next is using
Index Scanshipping_methods_pkeyon tableshipping_methodswithIndex Cond: (shipping_id = o.shipping_id), the result has only 1 row so it will continue to useNested Loop Inner Jointo join with theIntermediate Resultcreated in step 5
Happy coding!
Comments
Post a Comment