Nested Loop Join

Introduction

In SQL and specifically in PostgreSQL, there are 4 types of Joins as follows:

  • INNER JOIN: Only retrieves records that have a match in both tables.
  • LEFT JOIN: Retrieves all records from the left table, and if there is no match in the right table, the values are set to NULL.
  • RIGHT JOIN: The opposite of LEFT JOIN, rarely used because it can be rewritten in reverse using LEFT JOIN.
  • FULL OUTER JOIN: Retrieves all records from both tables, filling with NULL where there is no match.

Nested Loop Join

  • When executing a join, the Postgres Optimizer automatically performs an analysis based on the datasets of the 2 tables to select the most efficient and suitable algorithm for the current situation
  • First, let us look at the Nested Loop Join. This is the most basic join algorithm, and its operational mechanism is very straightforward: it takes each item from one table to compare it with every item from the other table, acting exactly like two nested for loops in programming

Use cases

This algorithm is utilized and achieves its best efficiency in the following cases:

  • One table is very small: If a table has only 2-3 rows, PostgreSQL only needs to iterate through the loop 2-3 times.
  • One large table already has an Index: At this point, for each value that needs to be compared, Postgres uses the index to retrieve the data immediately without needing to loop through all items.

Out-of-Scope

If both tables are large, such as table A having 100,000 records, table B having 1,000,000 records and there is no Index:

  • At this point, if a standard loop is executed, it must perform 100,000 x 1,000,000 = 100,000,000,000 (100 billion) comparison operations
  • The system will freeze or run extremely slowly, so PostgreSQL will typically switch automatically to another algorithm such as a Hash Join or Merge Join.

Detail

First, let us create the tables as follows, which is an example used in E-commerce for a users table and an orders table

CREATE TABLE users (
    id BIGSERIAL PRIMARY KEY,
    username VARCHAR(50) UNIQUE NOT NULL,
    email VARCHAR(100) UNIQUE NOT NULL,
    phone VARCHAR(15) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE orders (
    id BIGSERIAL PRIMARY KEY,
    user_id BIGINT REFERENCES users(id) NULL,
    status VARCHAR(20) NOT NULL,
    total_price DECIMAL(12, 2) NOT NULL,
    shipping_address TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Let us move on to the join queries first

-- Query 1
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
INNER JOIN orders ON users.id = orders.user_id;

-- Query 2
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
LEFT JOIN orders ON users.id = orders.user_id;

-- Query 3
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users 
RIGHT JOIN orders ON users.id = orders.user_id;

-- Query 4
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
FULL OUTER JOIN orders ON users.id = orders.user_id;
  • Query 1: Used to find orders that have a buyer name in the system, meaning user_id in the orders table must exist in the users table
  • Query 2: Shows orders by users in the system, including both users who have placed orders and users who have not, the result for users who have not placed an order will have (order_id, total_price) as NULL
  • Query 3: Shows all orders for payment calculation purposes along with the user if available, if user_id is NULL it means a guest user (a user not saved in the users table)
  • Query 4: Views a consolidated list including all users and orders, containing users/guest users who have or have not placed orders




Next, let us analyze this query to see how the Nested Loop join algorithm operates

EXPLAIN ANALYZE
SELECT users.username, orders.id AS order_id, orders.total_price
FROM users
JOIN orders ON users.id = orders.user_id and users.id = 10;

The result is as follows

  • First is the Seq Scan for the users table, because the record count is only 100, the Query Planner knows that using a direct Seq Scan is faster than using an Index, so it bypasses the existing primary key index, removing 99 rows to fetch only the 1 matching row
  • Next is the Seq Scan for the orders table, the reason being similar to the above, it simply filters out 989 rows to fetch 11 rows
  • Next, the Nested Loop is applied (because the dataset is relatively small) to join the 2 lists of data, when your data needing a join increases, Postgres will automatically utilize the Index and switch to another more suitable join algorithm

Happy coding!

See more articles here.

Comments

Popular posts from this blog

All Practice Series

Kubernetes Deployment for Zero Downtime

Deploying a NodeJS Server on Google Kubernetes Engine

Sitemap

React Practice Series

Setting up Kubernetes Dashboard with Kind

Monitoring with cAdvisor, Prometheus and Grafana on Docker

Helm for beginer - Deploy nginx to Google Kubernetes Engine

A Handy Guide to Using Dynamic Import in JavaScript

Installing PostgreSQL with Docker