GIN Index with JsonB

Introduction

In this article, we will continue exploring GIN Index with the JsonB data type.

JsonB, short for JSON Binary, is a data type developed from the JSON data type and supported by PostgreSQL since version 9.2.
The key difference between JSON and JsonB lies in how they are stored. JsonB supports binary storage and resolves the limitations of the JSON data type by optimizing the insert process and supporting indexing.

Creating an Index

When creating an index for a JsonB column, Postgres supports two strategies as follows: jsonb_ops: This is the default strategy that flattens the JSON to extract independent paths, keys and values to create Entries. For example, if you have a JSON like this:

{
  "shop": "ShopA",
  "products": [
    {"product_name": "mouse", "price": 50},
    {"product_name": "keyboard", "price": 100}
  ]
}

It will create a GIN Index as follows, note that Entries are stored as Hash values and the data structure is a B-tree:

| Entry (Hash value)                  | Posting List (TID) |
| ----------------------------------- | ------------------ |
| 'shop' (Key)                        | [10, ...]          |
| 'ShopA' (Value)                     | [10, ...]          |
| 'shop': 'ShopA' (Key-Value)         | [10, ...]          |
| 'products' (Array key)              | [10, ...]          |
| 'product_name' (Key)                | [10, ...]          |
| 'mouse' (Value)                     | [10, ...]          |
| 'product_name': 'mouse' (Key-Value) | [10, ...]          |
| 50 (Value)                          | [10, ...]          |

You can see that using this method creates highly detailed Entries, so if your JSON data is large, the created index could be even larger than the data in the Heap.

jsonb_path_ops: As the name suggests, this solution only hashes the combined path from the root to a specific value, instead of extracting individual keys and values like jsonb_ops.

Therefore, the number of created Entries will be significantly fewer, matching exactly the number of leaf nodes in the JSON, which saves storage volume.
However, the limitation is that it cannot perform detailed filtering like jsonb_ops, such as searching for a specific individual key.

With the JSON content above, when using jsonb_path_ops, the GIN Index will store data as follows:

| Entry (Hash value)                       | Posting List (TID) |
| ---------------------------------------- | ------------------ |
| 'shop' > 'ShopA'                         | [10, ...]          |
| 'products' > 'product_name' > 'mouse'    | [10, ...]          |
| 'products' > 'price' > 50                | [10, ...]          |
| 'products' > 'product_name' > 'keyboard' | [10, ...]          |
| 'products' > 'price' > 100               | [10, ...]          |

Detail

Let us create a table and indexes as follows:

CREATE TABLE product (
    id SERIAL PRIMARY KEY,
    name TEXT,
    attributes JSONB,
    tags_and_features JSONB
);

CREATE INDEX idx_product_attributes ON product USING gin (attributes jsonb_path_ops);
CREATE INDEX idx_product_tags_features ON product USING gin (tags_and_features);

Explanation of the related code: The attributes column is annotated to use the jsonb_path_ops index strategy to optimize capacity. The tags_and_features column is annotated to use the default jsonb_ops index strategy to optimize flexible search capabilities.

By default, if you do not pass a function, it will use jsonb_ops.
Meanwhile, using jsonb_path_ops will limit search features, but the storage capacity will be significantly reduced because it only stores the path.

First, let us use jsonb_path_ops as follows, which only supports the Index with the @> operator, if you use it with other operators, the query will still return results but the Index will not be active:

SELECT id, name, attributes FROM product WHERE attributes @> '{"color": "red"}';
SELECT * FROM product WHERE attributes @> '{"dimensions": {"unit": "cm"}}';
SELECT * FROM product WHERE attributes @@ '$.dimensions.width > 20';

Find products that have the color red.
Find products with a nested attribute structure containing unit as cm inside dimensions.
Find products with a width greater than 20 using the jsonb_path operator, the statement executes but does not use the index, because the GIN Index is created based on the JSON structure rather than just the single value of width, making comparison impossible.

Next is using jsonb_ops:

-- 1
SELECT * FROM product WHERE tags_and_features @> '{"tags": ["Rustic", "Handmade"]}';

-- 2
SELECT * FROM product WHERE tags_and_features @> '{"tags": ["Rustic"]}' OR tags_and_features @> '{"tags": ["Handmade"]}';

-- 3
SELECT * FROM product WHERE tags_and_features ? 'tags';

-- 4
SELECT * FROM product WHERE tags_and_features ?& ARRAY['rating', 'warranty_months'];

-- 5
SELECT * FROM product WHERE (tags_and_features->'tags') ? 'Rustic';

-- 6
SELECT * FROM product WHERE (tags_and_features->'tags') ?& ARRAY['Rustic', 'Handmade'];

-- 7
SELECT tags_and_features->>'tags' as tags FROM product;

-- 8
SELECT tags_and_features FROM product WHERE (tags_and_features->>'rating') > '2';

-- 9
SELECT tags_and_features->'tags' as tags FROM product;

-- 10
SELECT (tags_and_features->'tags')[1] as tag1 FROM product;

The first 2 queries will work with the Index

Query 1: Finds products where the tags array contains both "Rustic" and "Handmade" values simultaneously
Query 2: Finds products where the tags array contains either the "Rustic" value or the "Handmade" value

These 2 queries can also work with the Index because tags_and_features is at the outermost level, however, since the number of matching records equals the number in the Heap table, Postgres chose a Seq Scan instead

Query 3: Checks and finds products where the JSON object contains a key named 'tags'
Query 4: Finds products where the JSON object contains all the keys specified in the input array ('rating' and 'warranty_months')

The remaining queries will not use the Index because they do not directly utilize tags_and_features (which is the only thing indexed) but instead access internal fields (like tags or rating)

Query 5: Extracts the array from the tags key and checks if that array contains the 'Rustic' element
Query 6: Extracts the tags array and checks if that array simultaneously contains all elements in the ['Rustic', 'Handmade'] array
Query 7: Retrieves the tags field within tags_and_features, note that the ->> operator converts the retrieved value into the TEXT type
Query 8: Compares with the rating field
- Since tags_and_features->>'rating' is used, the retrieved value is of TEXT type, so the comparison value 2 must also be of TEXT type
- If you use tags_and_features->'rating', the retrieved value will be of JSONB type and cannot be compared
Query 9: Using tags_and_features->'tags' will extract a JSONB type
Query 10: Because the data retrieved from tags_and_features->'tags' is JSONB, it is possible to further access element 1 in the array