Using EXISTS and NOT EXISTS
Introduction
The EXISTS clause is used to solve a single problem in data logic which is checking the existence of data based on a condition, without worrying about how many times it exists or what the specific value is.
How it works
- Just like when using
INwithEXISTS, Postgres will also automatically useSemi-Joinalgorithms (orAnti-Semi-JoinforNOT EXISTS)Semi Join (EXISTS): Returns rows from the main table (left table) if there is at least one matching row in the sub-table. The optimizer stops searching as soon as it finds the first matching row for each record of the main table.Anti Join (NOT EXISTS): Contrary toSemi Join, it returns rows from the main table if no matching rows are found in the sub-table.
- It still uses popular algorithms such as
Nested Loop Semi Join, Hash Semi Join, Merge Semi Joinwith aShort-circuitmechanism, meaning that for each row value, it only needs to iterate until the first value is found, rather than finding all matching values like when using a regularJoin - However, in some cases when the data volume in the
Sub tableis unique orPostgreshas grouped the data for uniqueness, it can directly use theJoinalgorithm (Nested Loop Join, Hash Join, Merge Join)
Working with NULL
- When using
INandNOT IN, it only cares about the value inside the row, which leads to incorrect logic when data contains NULL (as when used with NOT IN). - Meanwhile,
EXISTSandNOT EXISTSonly care about the existence of data, regardless of what data the row contains (even if it contains all NULL values), it is still considered to exist.
Use cases
- Used in queries with complex logical conditions in the subquery rather than simply comparing a single column like IN
- If the subquery result is too large and you use NOT IN, it can lead to executing a Seq Scan on both tables (if the sub table has NULL), so using EXISTS/NOT EXISTS will provide better performance
- Using NOT EXISTS (also known as a Null-Safe operator) can replace NOT IN when used with a Dynamic List to avoid logical issues that return empty results caused by NULL.
- Used with conditional Update / Delete statements to modify data in table A based on the state of table B
Prerequisites
The queries in this article will be reused from previous articles, you can review them if any information is missing
Detail
First, let's try the following queries
SELECT o.order_id, o.customer_id
FROM orders o
WHERE o.order_id NOT IN (
SELECT oi.order_id
FROM order_items oi
WHERE oi.order_id IS NOT NULL
);
SELECT o.order_id, o.customer_id
FROM orders o
WHERE NOT EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.order_id = o.order_id
);
SELECT o.order_id, o.customer_id
FROM orders o
LEFT JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.order_id IS NULL;
- You can see that these queries all give the same result, because when you query a
Dynamic ListwithNOT IN, it can be rewritten usingNOT EXISTSandLEFT/RIGHT JOINto give the same result - However, when using
NOT EXISTSandLEFT/RIGHT JOIN, it can efficiently use aHash Anti Join, while usingNOT INuses aSeq Scanbecause it does not work well withNULLvalues and must perform a detailed check on each row
Let's move on to the next queries
-- Query 1
SELECT o.order_id, o.customer_id, o.order_date
FROM orders o
WHERE EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.order_id = o.order_id
AND oi.product_id = 100
);
-- Query 2
SELECT o.order_id, o.customer_id
FROM orders o
WHERE EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.order_id = o.order_id
AND oi.price_per_unit > 400.00
);
-- Query 3
SELECT o.order_id, o.status
FROM orders o
WHERE EXISTS (
SELECT 1
FROM order_items oi
WHERE oi.order_id = o.order_id
);
- Query 1: Find orders that purchased product ID
100- You can see that it uses a
Nested Loop Inner Joinbecause when using aSeq Scanon tableorder_itemswith the conditionproduct_id = 100, only 3 rows are retrieved - With such a small amount, Postgres can group the data to ensure this list is unique, so there is no need to use a
Semi Join - Because if a
Semi Joinis chosen, Postgres always has to check to ensure that each value is checked only once, which is not as efficient as using a regularJoinif the data is small - Therefore, Postgres will find a way to ensure the comparison list only includes unique values to use a
Joinif possible
- You can see that it uses a
- Query 2: Find orders that purchased products worth over
400withMerge Semi Join - Query 3: Find orders that actually purchased products
- Because the
orderstable has fewer rows than theorder_itemstable, it is selected by Postgres to create aHash Table - Therefore, the
orderstable is shifted to the right to become theBuild table, which is called aHash Right Semi Join
- Because the
Comments
Post a Comment