Using EXISTS and NOT EXISTS

Introduction

The EXISTS clause is used to solve a single problem in data logic which is checking the existence of data based on a condition, without worrying about how many times it exists or what the specific value is.

How it works

  • Just like when using IN with EXISTS, Postgres will also automatically use Semi-Join algorithms (or Anti-Semi-Join for NOT EXISTS)
    • Semi Join (EXISTS): Returns rows from the main table (left table) if there is at least one matching row in the sub-table. The optimizer stops searching as soon as it finds the first matching row for each record of the main table.
    • Anti Join (NOT EXISTS): Contrary to Semi Join, it returns rows from the main table if no matching rows are found in the sub-table.
  • It still uses popular algorithms such as Nested Loop Semi Join, Hash Semi Join, Merge Semi Join with a Short-circuit mechanism, meaning that for each row value, it only needs to iterate until the first value is found, rather than finding all matching values like when using a regular Join
  • However, in some cases when the data volume in the Sub table is unique or Postgres has grouped the data for uniqueness, it can directly use the Join algorithm (Nested Loop Join, Hash Join, Merge Join)

Working with NULL

  • When using IN and NOT IN, it only cares about the value inside the row, which leads to incorrect logic when data contains NULL (as when used with NOT IN).
  • Meanwhile, EXISTS and NOT EXISTS only care about the existence of data, regardless of what data the row contains (even if it contains all NULL values), it is still considered to exist.

Use cases

  • Used in queries with complex logical conditions in the subquery rather than simply comparing a single column like IN
  • If the subquery result is too large and you use NOT IN, it can lead to executing a Seq Scan on both tables (if the sub table has NULL), so using EXISTS/NOT EXISTS will provide better performance
  • Using NOT EXISTS (also known as a Null-Safe operator) can replace NOT IN when used with a Dynamic List to avoid logical issues that return empty results caused by NULL.
  • Used with conditional Update / Delete statements to modify data in table A based on the state of table B

Prerequisites

The queries in this article will be reused from previous articles, you can review them if any information is missing

Detail

First, let's try the following queries

SELECT o.order_id, o.customer_id
FROM orders o
WHERE o.order_id NOT IN (
    SELECT oi.order_id
    FROM order_items oi
    WHERE oi.order_id IS NOT NULL
);

SELECT o.order_id, o.customer_id
FROM orders o
WHERE NOT EXISTS (
    SELECT 1
    FROM order_items oi
    WHERE oi.order_id = o.order_id
);

SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.order_id IS NULL;
  • You can see that these queries all give the same result, because when you query a Dynamic List with NOT IN, it can be rewritten using NOT EXISTS and LEFT/RIGHT JOIN to give the same result
  • However, when using NOT EXISTS and LEFT/RIGHT JOIN, it can efficiently use a Hash Anti Join, while using NOT IN uses a Seq Scan because it does not work well with NULL values and must perform a detailed check on each row





Let's move on to the next queries

-- Query 1
SELECT o.order_id, o.customer_id, o.order_date
FROM orders o
WHERE EXISTS (
    SELECT 1 
    FROM order_items oi 
    WHERE oi.order_id = o.order_id 
      AND oi.product_id = 100
);

-- Query 2
SELECT o.order_id, o.customer_id
FROM orders o
WHERE EXISTS (
    SELECT 1 
    FROM order_items oi 
    WHERE oi.order_id = o.order_id 
      AND oi.price_per_unit > 400.00
);

-- Query 3
SELECT o.order_id, o.status
FROM orders o
WHERE EXISTS (
    SELECT 1 
    FROM order_items oi 
    WHERE oi.order_id = o.order_id
);
  • Query 1: Find orders that purchased product ID 100
    • You can see that it uses a Nested Loop Inner Join because when using a Seq Scan on table order_items with the condition product_id = 100, only 3 rows are retrieved
    • With such a small amount, Postgres can group the data to ensure this list is unique, so there is no need to use a Semi Join
    • Because if a Semi Join is chosen, Postgres always has to check to ensure that each value is checked only once, which is not as efficient as using a regular Join if the data is small
    • Therefore, Postgres will find a way to ensure the comparison list only includes unique values to use a Join if possible
  • Query 2: Find orders that purchased products worth over 400 with Merge Semi Join
  • Query 3: Find orders that actually purchased products
    • Because the orders table has fewer rows than the order_items table, it is selected by Postgres to create a Hash Table
    • Therefore, the orders table is shifted to the right to become the Build table, which is called a Hash Right Semi Join




Happy coding!

See more articles here.

Comments

Popular posts from this blog

All Practice Series

Kubernetes Deployment for Zero Downtime

Deploying a NodeJS Server on Google Kubernetes Engine

Sitemap

React Practice Series

A Handy Guide to Using Dynamic Import in JavaScript

Helm for beginer - Deploy nginx to Google Kubernetes Engine

DevOps Practice Series

Docker Practice Series

Setting up Kubernetes Dashboard with Kind