GIN Index with Array
Introduction
GIN
- GIN (Generalized Inverted Index) also uses a B-tree data structure, so it shares similar characteristics with B-Tree.
- However, unlike B-tree, which stores each value along with a Tid to a single row in the main table, GIN stores the value along with information to access multiple rows sharing that same value (referred to as a Posting List).
- Therefore, the difference in usage is that B-tree is used to search for a row containing a specific value, while GIN is used to search for small values nested inside a row.
Posting List
- You can understand it as a list containing IDs, which are primary keys corresponding to each row in the main table.
- If the table does not have an ID, it will use a physical identifier called ctid (Column Tuple Identifier) created by Postgres, with a structure consisting of (Page number, row index within the Page).
- Based on this ctid, Postgres can access any row in the main table.
- However, as you know, data in the heap is arranged arbitrarily and when data changes, the positions of the rows within the Page also change, leading to a modified ctid.
- This forces the GIN index to update the Posting List containing the ctids.
- It incurs additional processing overhead compared to using a fixed primary key, so to optimize performance for a GIN index, you should always add a primary key to the table, allowing Postgres to automatically use the primary key instead of the ctid.
Creating a GIN Index
Because the GIN index structure is very complex, where a single row of data is split into multiple Entries and each Entry has its own Posting List, the process of changing data in GIN consumes a significant amount of resources. To reduce the load, Postgres divides this process into two mechanisms:
- Direct update mechanism (fastupdate = off): After data changes, Postgres will look up the corresponding entries and modify the Post listing directly. Using this mode helps the index update immediately but increases the time required after data changes, because extra time is spent updating the index.
- Pending List (fastupdate = on, default): Using this mode, Postgres will not update the GIN index immediately when data changes, but will instead store the changes in a buffer memory space called the Pending list.
- Operations on the Pending list are sequential, making them faster than searching and updating within the index.
- When executing a query, Postgres checks data in both the GIN index and the Pending list, ensuring no data is missed, though it still incurs an extra cost to load data from the Pending list.
- When the Pending list becomes full or when Vacuum runs, it will update the data from the Pending list to the GIN index, then free up the Pending list.
Thus, you can see that a GIN index also incurs B-tree balancing costs when data changes and the cost is even greater because:
- Its number of Entries can be very large because a single field can be split into multiple Entries, rather than a 1-1 relationship as in a B-tree index.
- It must additionally update the Posting list containing a list of TIDs with the same value to access the row in the heap.
Application
Using a GIN Index effectively solves the following problems:
- Array Data: When a column contains a list of elements and you frequently search to see if an element exists within that array.
- Full-Text Search: When you want to find keywords that appear within a long article. GIN will split the article into individual words and index each word.
- JSONB Data: When you store complex JSON files and want to quickly search for key: value pairs or arrays nested deep inside the JSON.
GIN with Array
In this article, we will first explore how a GIN index handles the array data type. PostgreSQL supports GIN indexes for arrays of most common, popular base data types, such as:
- Integer arrays: smallint[], integer[], bigint[]
- Character string arrays: text[], varchar[], char[]
- Time arrays: date[], timestamp[], timestamptz[]
- Real number arrays: numeric[], real[]
- Identifier arrays: uuid[], oid[] Note that GIN does not support arrays of complex or unordered data types, such as
JSONB[]or multidimensional arrays likeint[][].
The index creation process for an array proceeds as follows:
- Postgres reads through each row to extract each item of the array.
- It removes duplicate items to get a list of unique Entries.
- These Entries are then hashed into numbers.
- The reason is to keep the size of the Entry fixed, since hashing produces a 4-byte uint32 number.
- When querying data, searching within the GIN index is done via comparison and comparing numbers is simpler than comparing text.
- After that, the Entries are sorted in ascending order to be inserted into the B-tree.
- Corresponding to each Entry, it stores an additional Post listing, which is information to access the corresponding rows, such as an ID or ctid.
Consequently, if you store a text array, Postgres will create a GIN index precisely for each item, for example, storing the array ['Text value']:
- Searches are case-sensitive: Searching for 'text value' will yield no results.
- Partial matching is not supported: Searching for the word 'value' will yield no results.
Detail
First, let us create a table with a TEXT[] column and create its corresponding index as follows:
CREATE TABLE test (
id SERIAL PRIMARY KEY,
name VARCHAR(100),
tags TEXT[]
);
CREATE INDEX idx_test_tags ON test USING gin(tags);
You can insert data like this:
INSERT INTO test (name, tags)
VALUES ('name 1', ARRAY['Value 1', 'value 2', 'vAlue 3']);
However, for the index to work, you should insert a large enough amount of data so that the Cost-based Optimizer of Postgres decides to read from the Index instead of performing a Seq Scan, because if the data is too small, it will determine that reading directly from the Heap is faster.
Check the results:
select * from test
You can perform queries like this:
-- 1
SELECT * FROM test WHERE tags @> ARRAY['clothing', 'beauty'];
-- 2
SELECT * FROM test WHERE tags @> ARRAY['Clothing'];
-- 3
SELECT * FROM test WHERE tags <@ ARRAY['clothing', 'sports'];
-- 4
SELECT * FROM test WHERE tags && ARRAY['clothing', 'home'];
- Query 1 uses the
@>operator to search for rows whose tag contains both['clothing', 'beauty']. - Query 2 will not work because arrays only support exact case-sensitive searches.
- Query 3 uses the
<@operator to find rows whose tags contain all items or whose tags only contain a single item belonging to the list['clothing', 'sports']. - Query 4 uses the
&&operator to find rows that only need to have at least one tag belonging to the list['clothing', 'home'].
Results
We will execute EXPLAIN ANALYZE to check if the Index is operational.
EXPLAIN ANALYZE
SELECT tags FROM test WHERE tags @> ARRAY['clothing', 'beauty'];
The result is as follows:
"Bitmap Heap Scan on test (cost=262.00..2171.10 rows=34088 width=87) (actual time=6.202..10.925 rows=39175.00 loops=1)"
" Recheck Cond: (tags @> '{clothing,beauty}'::text[])"
" Heap Blocks: exact=1483"
" Buffers: shared hit=1489 read=19"
" -> Bitmap Index Scan on idx_test_tags (cost=0.00..253.48 rows=34088 width=0) (actual time=6.061..6.061 rows=39175.00 loops=1)"
" Index Cond: (tags @> '{clothing,beauty}'::text[])"
" Index Searches: 1"
" Buffers: shared hit=6 read=19"
"Planning:"
" Buffers: shared hit=8 read=2"
"Planning Time: 0.338 ms"
"Execution Time: 11.937 ms"
Based on the above result, there are two additional concepts:
- Bitmap Index Scan: Postgres will scan the Index first to find rows matching the condition. Instead of fetching the data immediately, it creates a Bitmap in memory. Each bit represents a page containing a matching row.
- Bitmap Heap Scan: Postgres relies on the ordered bitmap to access the Pages on the Heap directly to fetch the rows.
You might wonder why Postgres still has to use a Heap scan when we only query the indexed column, the reason is that:
- The GIN index inherently does not store the full raw value of that row, because during index creation, it already split the value to create Entries in the index.
- Therefore, even if you query only that column, the GIN index cannot reconstruct the complete value from these Entries.
- Hence, Postgres is forced to perform an additional Heap scan to retrieve the complete data.
Analyzing the specific results, Postgres begins execution from the line marked with ->:
-> Bitmap Index Scan on idx_test_tags (cost=0.00..253.48 rows=34088 width=0) (actual time=6.061..6.061 rows=39175.00 loops=1) Index Cond: (tags @> '{clothing,beauty}'::text[])- Postgres used the index
idx_test_tagsto find rows containing both tags {clothing,beauty}. width=0: Shows that it does not retrieve data at this step, because the purpose of the Bitmap Index Scan is to extract TIDs and create a Bitmap on RAM.- Actual time is 6.061, the number of rows found is 39175, while the predicted number of rows is 34088.
- Postgres used the index
Index Searches: 1: The number of lookups in the index, it completed running in just 1 pass.Buffers: shared hit=6 read=19: This is the amount of resources consumed solely for reading the Index tree:shared hit=6: Postgres read 6 index data blocks already available on RAM.read=19: There are 19 index data blocks not yet on RAM, which Postgres had to read from disk, this is also the main factor causing queries to run slowly.
Bitmap Heap Scan on test (cost=262.00..2171.10 rows=34088 width=87) (actual time=6.202..10.925 rows=39175.00 loops=1)- After obtaining the Bitmap from the Index scan, this step will rely on those TIDs to perform direct Sequential I/O in the Heap to fetch all corresponding data, which is much faster than Random I/O.
- Actual time started at 6.202ms and finished at 10.925ms.
- width=87 means the returned data is 87 bytes.
Recheck Cond: (tags @> '{clothing,beauty}'::text[]): To verify whether the raw data fetched from the Heap matches the target value {clothing,beauty}.Heap Blocks: exact=1483- After running the Bitmap Index Scan, it will extract the number of Blocks/Pages in the Heap containing the corresponding data, which is 1483.
- The Bitmap operates in
exactmode, meaning the Bitmap contains both the block numbers and the row numbers, so during the Heap scan, it only needs to read exactly as specified to load the data. - If the Bitmap operates in
lossymode due to insufficient database RAM configuration, it only stores Block number information without specific row numbers, leading to the Heap scan having to load the entire Block and check each row, which incurs extra CPU processing overhead.
Buffers: shared hit=1489 read=19- shared hit=1489: There are 1,489 data blocks already residing on RAM (Cache).
- read=19: There are only 19 data blocks that Postgres had to read from disk.
Happy coding!
Comments
Post a Comment