Using IN and NOT IN

Introduction

The IN clause is usually used in two main cases:

Literal Values: When you already know the exact values to filter and the number of elements is small (from a few elements to a few dozen elements).
- Note that when parsing, Postgres will treat each item in Literal Values as a single node to execute the Parse Tree, so if the number of items is too large, it will consume a lot of RAM and CPU to process
- By the time planning is executed, the Query Planner will have to traverse this Tree to process, so you should be careful not to query with too many Literal Values as it will significantly affect performance
- It can be rewritten with EXISTS but it will make the query more complex, so if using Literal Values, it is best to use IN to keep the query simple and easy to understand
- An example of usage is filtering out orders with statuses like processing or completed.
Dynamic List: When the filter list depends on the data of another table.

In most cases, it can be replaced by using EXISTS or LEFT/RIGHT JOIN to process more efficiently and accurately
An example of usage is finding all customers who have purchased products belonging to the best-selling categories.

Algorithms Used

When using IN with a Subquery, the Postgres Query Planner will usually perform a technique called Subquery Flattening to convert the IN syntax into a Semi-Join.
Unlike a Join which retrieves all matching pairs and can duplicate rows if the sub-table has multiple matching rows, a Semi-Join only checks for the existence of a matching row in the sub-table, if found, it immediately selects the row from the main table and stops, ensuring the main table never duplicates rows.
Therefore, when used with a Semi-Join, the maximum number of rows received is equal to the main table instead of {main table x sub table} as when using a normal Join

Depending on the data size and Index, Postgres will choose one of the following three Semi-Join algorithms:

Hash Semi Join
- This is the most commonly used algorithm when the sub-table (the table in the Subquery) has a medium or large size and lacks a suitable index, but the data can fit into memory (work_mem).
- Postgres scans the table in the Subquery and hashes the values to be checked into a Hash Table in RAM.
- After that, it scans the main table, also takes the value of the column to be compared to generate a Hash Number and compares it with the Hash Table
- If there is a match, it returns the row result in the main table immediately, without needing to continue searching for other matching values like a standard Hash Join
Nested Loop with Index Scan
- Best used when the table in the Subquery is small, OR the main table is large and has an Index on the column to be used.
- First, Postgres will retrieve the list of values from the Subquery.
- Then it will take each value to search in the Index of the main table to get the full data
Merge Semi Join
- Best used when the joining columns of both the main table and the sub-table are already sorted, such as when an index exists
- Postgres will run two pointers in parallel on the two sorted tables to compare. This algorithm is extremely efficient in saving CPU and RAM because it does not have to create a hash table.

NOT IN

Contrary to IN (which is optimized into a Semi-Join), when you use NOT IN with a Subquery, PostgreSQL will usually convert it into an operation called an Anti-Semi-Join (or Anti-Join for short).
The Anti-Join operation works on the opposite principle of a Semi-Join to search for the absence of data, meaning that if any matching row is found in the child table, it will immediately discard that corresponding row in the parent table (only retrieving rows from the main table that DO NOT FIND any matching rows in the sub-table).
Postgres will check each row of the main table to match against the list from the Subquery. If that row completely does not appear in the list of the Subquery, then it is kept to return the result.

Limitations with NULL

In SQL, NULL is not a value, it represents Unknown. Therefore, any comparison with NULL (such as = NULL or != NULL) returns a result of Unknown.
If your Subquery statement returns even a single row with a NULL value, the logic of NOT IN will turn into WHERE id != 1 AND id != 2 AND id != NULL
Because id != NULL returns Unknown, the entire WHERE clause expression will be Unknown. As a result, your NOT IN statement will return 0 rows (completely empty), regardless of how much data the main table has.
A specific example is as follows
- When using IN, a query like Select id from table WHERE id IN (1, NULL) will operate like this Select id from table WHERE id = 1 OR id = NULL, using OR only requires one true condition, so even when comparing with NULL, results can still be obtained
- When using NOT IN, a query like Select id from table WHERE id NOT IN (1, NULL) will operate like this Select id from table WHERE id != 1 AND id != NULL, using AND requires all conditions to be true, so just one comparison id != NULL will result in Unknown, leading to no matching results
Because of this complex NULL handling behavior, the Query Planner encounters many difficulties when using NOT IN, so you need to be careful when using it to avoid performance bugs and skewed results
- If the column in the Subquery has a NOT NULL constraint: Postgres will use Hash Anti Join or Merge Anti Join so it can operate efficiently.
- If the column in the Subquery is Nullable: Postgres must switch to using a Sequential Scan (similar to a Nested Loop) to check each row, however, even with the comparison col != NULL, the result is Unknown, so in the end this query runs slowly and the result is also incorrect
You can leverage the logical characteristic that NOT IN will return empty if the Subquery contains a NULL value for Data Auditing to alert on system data integrity errors if any row in the sub-table is found to be NULL

Solutions

Using IN
- If using Literal Values with a large amount of data, you should switch to using it in combination with an Array or Temporary table and then join
- If using a Dynamic List, it can be replaced with EXISTS which will work more efficiently with NULL values
Using NOT IN
- Must eliminate NULL right in the subquery
- Switch to using NOT EXISTS because it can guarantee accurate processing even with NULL values

Detail

First, create the table as follows

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT NOT NULL,
    order_date TIMESTAMP NOT NULL,
    status VARCHAR(50) NOT NULL
);

CREATE TABLE order_items (
    item_id INT PRIMARY KEY,
    order_id INT REFERENCES orders(order_id),
    product_id INT NOT NULL,
    quantity INT NOT NULL,
    price_per_unit NUMERIC(12,2) NOT NULL
);

Next is the form of using Literal Values, which is most commonly used with IN

SELECT order_id 
FROM order_items 
WHERE product_id IN (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)

SELECT order_id 
FROM order_items 
WHERE product_id = ANY ('{10, 20, 30, 40, 50, 60, 70, 80, 90, 100}'::int[]);

When you analyze both queries above, you will see that they execute exactly the same, but using ANY is better for the following reasons

Maximize the use of Prepared Statements to avoid SQL Injection and Plan Re-forcing
- If using the format IN ($1, $2, $3), every time the array length changes, Postgres considers it a completely new query. It will consume CPU to re-parse the SQL, analyze and rebuild the Execution Plan from scratch. This wastes system resources when the query frequency is high.
- If using ANY: The query structure is always fixed with a single parameter (which is the array entity). Whether the passed array has 2 elements or 1,000 elements, the sent SQL string does not change. Postgres only parses exactly once the first time, stores the Plan in the Prepared Statement Cache and reuses it for subsequent times. The response speed will be faster and more stable.
Clean code, easy to handle Logic at the Backend when working with a Dynamic Array:
- If using IN: You have to write code to concatenate strings, join commas (?,?,?) or transform arrays into dynamic strings, which is very troublesome and prone to security vulnerabilities if not handled carefully.
- If using ANY: You just need to directly use the Array variable into the Postgres driver safely and professionally.

Next is using a Dynamic List

-- Query 1
SELECT order_id, customer_id, status 
FROM orders 
WHERE order_id IN (
    SELECT order_id 
    FROM order_items 
    WHERE product_id IN (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
);

-- Query 2
SELECT o.order_id, o.status
FROM orders o
WHERE o.customer_id = 99
  AND o.order_id IN (
      SELECT oi.order_id 
      FROM order_items oi
      WHERE oi.order_id IS NOT NULL
  );

-- Query 3
SELECT o.order_id, o.status 
FROM orders o
WHERE o.order_id NOT IN (NULL)

Query 1: Used to find orders that purchased the fixed products passed in
- You can see the algorithm used is Hash Semi Join
- First, it will run a Seq Scan for the table order_items to create a Hash Table
- Next is a Seq Scan on table orders and then compare according to Hash Cond: (orders.order_id = order_items.order_id)
Query 2: Find orders that purchased products of a specific customer
- The algorithm used is Nested Loop Semi Join, using WHERE oi.order_id IS NOT NULL is precisely to eliminate NULL right in the Sub-query
- First, it runs a Bitmap Index Scan with idx_orders_customer_date to create a Bitmap, then runs a Bitmap Heap Scan to get data from table orders
- Then it uses that data to check with the Index Only Scan of table order_items and retrieves the corresponding rows
Query 3: This is to show you the limitation of using NOT IN
- According to the understanding of the syntax, we might expect from this query to get all orders where order_id is not in the NULL data set, meaning we want to get orders where order_id != NULL
- However, when you query like that, the result will be empty because as I explained above, the comparison order_id != NULL is Unknown
- To compare with NULL, you can only use IS NULL or IS NOT NULL