Using IN and NOT IN
Introduction
The IN clause is usually used in two main cases:
Literal Values: When you already know the exact values to filter and the number of elements is small (from a few elements to a few dozen elements).- Note that when parsing, Postgres will treat each item in Literal Values as a single node to execute the Parse Tree, so if the number of items is too large, it will consume a lot of RAM and CPU to process
- By the time planning is executed, the Query Planner will have to traverse this Tree to process, so you should be careful not to query with too many Literal Values as it will significantly affect performance
- It can be rewritten with
EXISTSbut it will make the query more complex, so if usingLiteral Values, it is best to useINto keep the query simple and easy to understand - An example of usage is filtering out orders with statuses like
processingorcompleted.
Dynamic List: When the filter list depends on the data of another table.- In most cases, it can be replaced by using
EXISTSorLEFT/RIGHT JOINto process more efficiently and accurately - An example of usage is finding all customers who have purchased products belonging to the best-selling categories.
Algorithms Used
- When using
INwith aSubquery, the PostgresQuery Plannerwill usually perform a technique calledSubquery Flatteningto convert the IN syntax into aSemi-Join. - Unlike a
Joinwhich retrieves all matching pairs and can duplicate rows if the sub-table has multiple matching rows, aSemi-Joinonly checks for the existence of a matching row in the sub-table, if found, it immediately selects the row from the main table and stops, ensuring the main table never duplicates rows. - Therefore, when used with a Semi-Join, the maximum number of rows received is equal to the
main tableinstead of{main table x sub table}as when using a normalJoin
Depending on the data size and Index, Postgres will choose one of the following three Semi-Join algorithms:
Hash Semi Join- This is the most commonly used algorithm when the sub-table (the table in the Subquery) has a medium or large size and lacks a suitable index, but the data can fit into memory (work_mem).
- Postgres scans the table in the Subquery and hashes the values to be checked into a Hash Table in RAM.
- After that, it scans the main table, also takes the value of the column to be compared to generate a Hash Number and compares it with the Hash Table
- If there is a match, it returns the row result in the main table immediately, without needing to continue searching for other matching values like a standard Hash Join
Nested LoopwithIndex Scan- Best used when the table in the Subquery is small, OR the main table is large and has an Index on the column to be used.
- First, Postgres will retrieve the list of values from the
Subquery. - Then it will take each value to search in the Index of the main table to get the full data
Merge Semi Join- Best used when the joining columns of both the main table and the sub-table are already sorted, such as when an index exists
- Postgres will run two pointers in parallel on the two sorted tables to compare. This algorithm is extremely efficient in saving CPU and RAM because it does not have to create a hash table.
NOT IN
- Contrary to
IN(which is optimized into aSemi-Join), when you useNOT INwith a Subquery,PostgreSQLwill usually convert it into an operation called anAnti-Semi-Join(orAnti-Joinfor short). - The
Anti-Joinoperation works on the opposite principle of aSemi-Jointo search for the absence of data, meaning that if any matching row is found in the child table, it will immediately discard that corresponding row in the parent table (only retrieving rows from the main table that DO NOT FIND any matching rows in the sub-table). Postgreswill check each row of the main table to match against the list from theSubquery. If that row completely does not appear in the list of theSubquery, then it is kept to return the result.
Limitations with NULL
- In
SQL,NULLis not a value, it representsUnknown. Therefore, any comparison withNULL(such as = NULL or != NULL) returns a result ofUnknown. - If your Subquery statement returns even a single row with a
NULLvalue, the logic of NOT IN will turn intoWHERE id != 1 AND id != 2 AND id != NULL - Because
id != NULLreturnsUnknown, the entireWHEREclause expression will beUnknown. As a result, yourNOT INstatement will return 0 rows (completely empty), regardless of how much data the main table has. - A specific example is as follows
- When using
IN, a query likeSelect id from table WHERE id IN (1, NULL)will operate like thisSelect id from table WHERE id = 1 OR id = NULL, usingORonly requires one true condition, so even when comparing withNULL, results can still be obtained - When using
NOT IN, a query likeSelect id from table WHERE id NOT IN (1, NULL)will operate like thisSelect id from table WHERE id != 1 AND id != NULL, usingANDrequires all conditions to be true, so just one comparisonid != NULLwill result inUnknown, leading to no matching results
- When using
- Because of this complex
NULLhandling behavior, theQuery Plannerencounters many difficulties when usingNOT IN, so you need to be careful when using it to avoid performance bugs and skewed results- If the column in the
Subqueryhas aNOT NULLconstraint: Postgres will useHash Anti JoinorMerge Anti Joinso it can operate efficiently. - If the column in the
SubqueryisNullable: Postgres must switch to using aSequential Scan(similar to aNested Loop) to check each row, however, even with the comparisoncol != NULL, the result isUnknown, so in the end this query runs slowly and the result is also incorrect
- If the column in the
- You can leverage the logical characteristic that NOT IN will return empty if the Subquery contains a NULL value for Data Auditing to alert on system data integrity errors if any row in the sub-table is found to be NULL
Solutions
- Using
IN- If using
Literal Valueswith a large amount of data, you should switch to using it in combination with anArrayorTemporary tableand then join - If using a
Dynamic List, it can be replaced withEXISTSwhich will work more efficiently withNULLvalues
- If using
- Using
NOT IN- Must eliminate
NULLright in the subquery - Switch to using
NOT EXISTSbecause it can guarantee accurate processing even withNULLvalues
- Must eliminate
Detail
First, create the table as follows
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT NOT NULL,
order_date TIMESTAMP NOT NULL,
status VARCHAR(50) NOT NULL
);
CREATE TABLE order_items (
item_id INT PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
product_id INT NOT NULL,
quantity INT NOT NULL,
price_per_unit NUMERIC(12,2) NOT NULL
);
Next is the form of using Literal Values, which is most commonly used with IN
SELECT order_id
FROM order_items
WHERE product_id IN (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
SELECT order_id
FROM order_items
WHERE product_id = ANY ('{10, 20, 30, 40, 50, 60, 70, 80, 90, 100}'::int[]);
When you analyze both queries above, you will see that they execute exactly the same, but using ANY is better for the following reasons
- Maximize the use of
Prepared Statementsto avoidSQL Injectionand Plan Re-forcing- If using the format
IN ($1, $2, $3), every time the array length changes, Postgres considers it a completely new query. It will consume CPU to re-parse the SQL, analyze and rebuild theExecution Planfrom scratch. This wastes system resources when the query frequency is high. - If using
ANY: The query structure is always fixed with a single parameter (which is the array entity). Whether the passed array has 2 elements or 1,000 elements, the sent SQL string does not change. Postgres only parses exactly once the first time, stores the Plan in thePrepared Statement Cacheand reuses it for subsequent times. The response speed will be faster and more stable.
- If using the format
- Clean code, easy to handle Logic at the Backend when working with a
Dynamic Array:- If using
IN: You have to write code to concatenate strings, join commas (?,?,?) or transform arrays into dynamic strings, which is very troublesome and prone to security vulnerabilities if not handled carefully. - If using
ANY: You just need to directly use theArrayvariable into thePostgresdriver safely and professionally.
- If using
Next is using a Dynamic List
-- Query 1
SELECT order_id, customer_id, status
FROM orders
WHERE order_id IN (
SELECT order_id
FROM order_items
WHERE product_id IN (10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
);
-- Query 2
SELECT o.order_id, o.status
FROM orders o
WHERE o.customer_id = 99
AND o.order_id IN (
SELECT oi.order_id
FROM order_items oi
WHERE oi.order_id IS NOT NULL
);
-- Query 3
SELECT o.order_id, o.status
FROM orders o
WHERE o.order_id NOT IN (NULL)
- Query 1: Used to find orders that purchased the fixed products passed in
- You can see the algorithm used is
Hash Semi Join - First, it will run a Seq Scan for the table
order_itemsto create a Hash Table - Next is a Seq Scan on table
ordersand then compare according toHash Cond: (orders.order_id = order_items.order_id)
- You can see the algorithm used is
- Query 2: Find orders that purchased products of a specific customer
- The algorithm used is
Nested Loop Semi Join, usingWHERE oi.order_id IS NOT NULLis precisely to eliminateNULLright in the Sub-query - First, it runs a
Bitmap Index Scanwithidx_orders_customer_dateto create a Bitmap, then runs aBitmap Heap Scanto get data from tableorders - Then it uses that data to check with the
Index Only Scanof tableorder_itemsand retrieves the corresponding rows
- The algorithm used is
- Query 3: This is to show you the limitation of using
NOT IN- According to the understanding of the syntax, we might expect from this query to get all
orderswhereorder_idis not in theNULLdata set, meaning we want to get orders whereorder_id != NULL - However, when you query like that, the result will be empty because as I explained above, the comparison
order_id != NULLisUnknown - To compare with
NULL, you can only useIS NULLorIS NOT NULL
- According to the understanding of the syntax, we might expect from this query to get all
Happy coding!
Comments
Post a Comment