Join with Multiple Column Comparisons

Introduction

When performing a table join that compares multiple columns simultaneously (such as ON A.col1 = B.col1 AND A.col2 = B.col2), PostgreSQL will still use familiar algorithms like Hash Join, Merge Join or Nested Loop Join. However, the way these algorithms process multiple columns involves very distinctive strategies as follows.

Hash Join

  • This is usually the number one choice for Postgres when joining multiple columns on large tables. Instead of hashing each individual column, Postgres performs string concatenation to hash the combination of all those columns at the same time.
  • Build Phase: Postgres takes the values of all participating join columns and groups them together (like combining value = col1 + col2 + col3), then inputs this entire combined string into the Hash Function to calculate the Hash Number to be placed into the Bucket.
  • Probe Phase: Next, it also combines the corresponding columns of the remaining table to hash similarly and then compares it with the Hash Table.
  • Performance will be extremely fast because joining 3 columns or 5 columns has almost no difference in hashing and lookup speed compared to joining 1 column, because they all ultimately resolve to a single hash code in RAM.

Merge Join

  • To use Merge Join for multiple columns, Postgres must sort both tables according to the priority from left to right of the join columns. This is called multi-level sorting, starting from the first column and if there are duplicate values in this column, it will sort the corresponding values in the next column.
  • When running the algorithm, two pointers will compare the first column first. If they are equal, it then compares the second and third columns to decide which pointer to move next.
  • Merge Join operates most efficiently when a Composite Index is already available for the columns used in the join, in which case Postgres does not have to spend CPU overhead to sort but only needs to let the pointer run sequentially for comparison.

Nested Loop Join

  • Nested Loop Join processes multiple columns in a sequential, step-by-step manner where Postgres takes each item from table A to compare with the index in table B.
  • If there is an Index on only 1 column, Postgres will use the Index to retrieve all corresponding rows according to that column first, then perform a secondary condition check (on the remaining columns that do not have an index) manually on these rows.
  • If there is a Composite Index on all necessary join columns, Postgres can immediately retrieve the corresponding row in the table without needing to filter manually for the other columns.

Effective Usage

  • Thus, for joining with multiple column filters to perform best, you still need to create an additional Composite Index for the join columns, which is effective for both Nested Loop Join and Merge Join.
  • Note that if you create a separate index for each column, it will only be effective when filtering for those columns individually, so if you filter multiple columns, only the index of the first column can be used and then you must also filter manually for the remaining columns.

Detail

First, let us create the table and the Indexes as follows:

CREATE TABLE product_variants (
    product_id INT NOT NULL,
    sku TEXT,
    color VARCHAR(30),
    size VARCHAR(10),
    price DECIMAL(12, 2) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE inventory_status (
    product_id INT NOT NULL,
    sku TEXT,
    warehouse_id INT NOT NULL,
    stock_quantity INT NOT NULL DEFAULT 0,
    last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE INDEX idx_variants_prod_sku ON product_variants (product_id, sku);
CREATE INDEX idx_inventory_prod_sku ON inventory_status (product_id, sku);

Next are the queries as follows:

-- Query 1
SELECT 
    v.product_id, 
    v.sku, 
    v.price, 
    i.warehouse_id, 
    i.stock_quantity
FROM product_variants v
INNER JOIN inventory_status i 
    ON v.product_id = i.product_id AND v.sku = i.sku;

-- Query 2
SELECT 
    v.product_id, 
    v.sku, 
    v.price, 
    i.stock_quantity
FROM product_variants v
INNER JOIN inventory_status i 
    ON v.product_id = i.product_id AND v.sku = i.sku
WHERE v.product_id IN (105, 206, 307);

-- Query 3
SELECT 
    v.product_id, 
    v.sku, 
    v.price, 
    i.stock_quantity
FROM product_variants v
INNER JOIN inventory_status i 
    ON v.product_id = i.product_id AND v.sku = i.sku
LIMIT 100 OFFSET 0
  • Query 1: Retrieve information about products currently in stock
    • Because the amount of data retrieved will be very large, Postgres will prioritize choosing Hash Join.
    • The process will involve using Seq Scan for the table product_variants to create the Hash Table and the table inventory_status will rely on that for comparison.
    • You can see in the Hash Cond section that both columns (product_id, sku) will be used to create the Hash table.
  • Query 2: Retrieve information for some specific products currently in stock
    • Since the amount of data retrieved is not large, Postgres chooses Nested Loop Inner Join.
    • Combined with using 2 Composite Indexes which are idx_variants_prod_sku and idx_inventory_prod_sku.
  • Query 3: Retrieve information about products currently in stock, using LIMIT 100 OFFSET 0 for pagination functionality
    • You can see that Postgres uses 2 Composite Indexes idx_variants_prod_sku and idx_inventory_prod_sku which already have sorted data for the 2 columns (product_id, sku) used in the Join.
    • Because the amount of data retrieved is moderate, Merge Inner Join is selected, rather than spending cost to create a Hash Table compared to a Hash Index.



Happy coding!

See more articles here.

Comments

Popular posts from this blog

All Practice Series

Kubernetes Deployment for Zero Downtime

Deploying a NodeJS Server on Google Kubernetes Engine

Sitemap

React Practice Series

A Handy Guide to Using Dynamic Import in JavaScript

Helm for beginer - Deploy nginx to Google Kubernetes Engine

DevOps Practice Series

Docker Practice Series

Setting up Kubernetes Dashboard with Kind