GIN Index with JsonB
Introduction
In this article, we will continue exploring GIN Index with the JsonB data type.
- JsonB, short for JSON Binary, is a data type developed from the JSON data type and supported by PostgreSQL since version 9.2.
- The key difference between JSON and JsonB lies in how they are stored. JsonB supports binary storage and resolves the limitations of the JSON data type by optimizing the insert process and supporting indexing.
Creating an Index
When creating an index for a JsonB column, Postgres supports two strategies as follows: jsonb_ops: This is the default strategy that flattens the JSON to extract independent paths, keys and values to create Entries. For example, if you have a JSON like this:
{
"shop": "ShopA",
"products": [
{"product_name": "mouse", "price": 50},
{"product_name": "keyboard", "price": 100}
]
}
It will create a GIN Index as follows, note that Entries are stored as Hash values and the data structure is a B-tree:
| Entry (Hash value) | Posting List (TID) |
| ----------------------------------- | ------------------ |
| 'shop' (Key) | [10, ...] |
| 'ShopA' (Value) | [10, ...] |
| 'shop': 'ShopA' (Key-Value) | [10, ...] |
| 'products' (Array key) | [10, ...] |
| 'product_name' (Key) | [10, ...] |
| 'mouse' (Value) | [10, ...] |
| 'product_name': 'mouse' (Key-Value) | [10, ...] |
| 50 (Value) | [10, ...] |
You can see that using this method creates highly detailed Entries, so if your JSON data is large, the created index could be even larger than the data in the Heap.
jsonb_path_ops: As the name suggests, this solution only hashes the combined path from the root to a specific value, instead of extracting individual keys and values like jsonb_ops.
- Therefore, the number of created Entries will be significantly fewer, matching exactly the number of leaf nodes in the JSON, which saves storage volume.
- However, the limitation is that it cannot perform detailed filtering like
jsonb_ops, such as searching for a specific individual key.
With the JSON content above, when using jsonb_path_ops, the GIN Index will store data as follows:
| Entry (Hash value) | Posting List (TID) |
| ---------------------------------------- | ------------------ |
| 'shop' > 'ShopA' | [10, ...] |
| 'products' > 'product_name' > 'mouse' | [10, ...] |
| 'products' > 'price' > 50 | [10, ...] |
| 'products' > 'product_name' > 'keyboard' | [10, ...] |
| 'products' > 'price' > 100 | [10, ...] |
Detail
Let us create a table and indexes as follows:
CREATE TABLE product (
id SERIAL PRIMARY KEY,
name TEXT,
attributes JSONB,
tags_and_features JSONB
);
CREATE INDEX idx_product_attributes ON product USING gin (attributes jsonb_path_ops);
CREATE INDEX idx_product_tags_features ON product USING gin (tags_and_features);
Explanation of the related code: The attributes column is annotated to use the jsonb_path_ops index strategy to optimize capacity. The tags_and_features column is annotated to use the default jsonb_ops index strategy to optimize flexible search capabilities.
- By default, if you do not pass a function, it will use
jsonb_ops. - Meanwhile, using
jsonb_path_opswill limit search features, but the storage capacity will be significantly reduced because it only stores the path.
First, let us use jsonb_path_ops as follows, which only supports the Index with the @> operator, if you use it with other operators, the query will still return results but the Index will not be active:
SELECT id, name, attributes FROM product WHERE attributes @> '{"color": "red"}';
SELECT * FROM product WHERE attributes @> '{"dimensions": {"unit": "cm"}}';
SELECT * FROM product WHERE attributes @@ '$.dimensions.width > 20';
- Find products that have the color red.
- Find products with a nested attribute structure containing unit as cm inside dimensions.
- Find products with a width greater than 20 using the jsonb_path operator, the statement executes but does not use the index, because the GIN Index is created based on the JSON structure rather than just the single value of width, making comparison impossible.
Next is using jsonb_ops:
-- 1
SELECT * FROM product WHERE tags_and_features @> '{"tags": ["Rustic", "Handmade"]}';
-- 2
SELECT * FROM product WHERE tags_and_features @> '{"tags": ["Rustic"]}' OR tags_and_features @> '{"tags": ["Handmade"]}';
-- 3
SELECT * FROM product WHERE tags_and_features ? 'tags';
-- 4
SELECT * FROM product WHERE tags_and_features ?& ARRAY['rating', 'warranty_months'];
-- 5
SELECT * FROM product WHERE (tags_and_features->'tags') ? 'Rustic';
-- 6
SELECT * FROM product WHERE (tags_and_features->'tags') ?& ARRAY['Rustic', 'Handmade'];
-- 7
SELECT tags_and_features->>'tags' as tags FROM product;
-- 8
SELECT tags_and_features FROM product WHERE (tags_and_features->>'rating') > '2';
-- 9
SELECT tags_and_features->'tags' as tags FROM product;
-- 10
SELECT (tags_and_features->'tags')[1] as tag1 FROM product;
The first 2 queries will work with the Index
- Query 1: Finds products where the tags array contains both "Rustic" and "Handmade" values simultaneously
- Query 2: Finds products where the tags array contains either the "Rustic" value or the "Handmade" value
These 2 queries can also work with the Index because tags_and_features is at the outermost level, however, since the number of matching records equals the number in the Heap table, Postgres chose a Seq Scan instead
- Query 3: Checks and finds products where the JSON object contains a key named 'tags'
- Query 4: Finds products where the JSON object contains all the keys specified in the input array ('rating' and 'warranty_months')
The remaining queries will not use the Index because they do not directly utilize tags_and_features (which is the only thing indexed) but instead access internal fields (like tags or rating)
- Query 5: Extracts the array from the
tagskey and checks if that array contains the 'Rustic' element - Query 6: Extracts the
tagsarray and checks if that array simultaneously contains all elements in the['Rustic', 'Handmade']array - Query 7: Retrieves the
tagsfield withintags_and_features, note that the->>operator converts the retrieved value into theTEXTtype - Query 8: Compares with the
ratingfield- Since
tags_and_features->>'rating'is used, the retrieved value is ofTEXTtype, so the comparison value2must also be ofTEXTtype - If you use
tags_and_features->'rating', the retrieved value will be ofJSONBtype and cannot be compared
- Since
- Query 9: Using
tags_and_features->'tags'will extract aJSONBtype - Query 10: Because the data retrieved from
tags_and_features->'tags'isJSONB, it is possible to further access element1in the array
Happy coding!
Comments
Post a Comment