GIN Index with TEXT

Introduction

In this article, we will continue to explore how to use GIN Index with the text data type
First, please note that GIN Index only works with multi-valued data types, meaning fields containing multiple values, so you can easily create an index with the array type
However, if you create a similar index with the text type, an error will be reported because text is only a single-valued type, containing only one content string, so you must pass an additional function that defines how to split the text value into items in an array before the index can be created
If you remember, when using GIN Index with a text array, you will encounter limitations regarding partial search and must enter the exact word to search for it to work, now you can solve that problem by using text combined with two functions, to_tsvector and pg_trgm

Full-Text Search

This method uses the to_tsvector function and its index creation process will be as follows

Tokenization & Normalization
- Split the text value into individual words based on spaces and punctuation
- Lowercasing
- Remove Stop Words, which are words that appear too frequently but carry no search value, such as the words the, a, an and and
- Reduce to root words, for example, running and runs both become run
Entry Creation
- At this point, long text content only contains the main meaningful words, so you have a list of words just like when using a Text array
- The next process is similar to working with an array, where the items will be used as Entries along with a Posting list containing information to access the row
- However, its Posting list will also contain additional information about the position of that item in the original text, because full-text search also supports searching related to the position of words
Thus, using this method has solved partial search and no longer has case-sensitivity constraints
As for the removed stop words not being searchable, it is not really a problem because these are considered words that do not carry specific meaning to form the content of a sentence, so when searching, users will not use these words to look for real content

Trigram

This method uses the pg_trgm function, as the name suggests, you can also guess that it will automatically split the text value into a list of items, where each item has 3 characters This number of 3 characters is fixed and you cannot change it because the Postgres team considers this to be the perfect balance between storage volume and performance

You will not have to create a list of too many meaningless sets as when splitting by 2 characters, causing volume waste
Nor do you have to create a set containing too many characters such as 4 characters, which would make it impossible to search for a smaller number of characters

The text search process will be as follows

Postgres will perform Tokenization & Normalization on your search content according to Full-text search or Trigram to split it into corresponding string lists
Next, use GIN Index to search that list by Entry, then extract the list of TIDs
Based on the TID, extract the row values and then perform a recheck with the raw data to ensure accurate results

Thus, you can see that the size of the GIN Index is directly proportional to the stored Text content

If the content does not duplicate much, the Entry will be larger
If the content duplicates a lot, the Posting list will be larger

Detail

First, let's create a table as follows

CREATE TABLE post (
    id SERIAL PRIMARY KEY,
    gin_title TEXT,
    b_tree_title TEXT,
    english_description TEXT,
    simple_description TEXT
);

If you try to create a normal Index as with the array type, you will see an error indicating a missing default operator class, because TEXT is a single-valued type, so a GIN Index cannot be created directly

CREATE INDEX idx_post_gin_title ON post USING gin(gin_title);

Full-Text Search

First, let's look at the examples of Full-Text Search

CREATE INDEX idx_post_simple_description ON post USING gin (to_tsvector('simple', simple_description));
CREATE INDEX idx_post_english_description ON post USING gin (to_tsvector('english', english_description));

When you use to_tsvector with simple, it simply splits the text into clusters of 3 characters and lowercases them
When used with english or a specific language, stop words will be deleted and the remaining words will be converted to their origin

The usage examples are very diverse as follows

SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'apollotech');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'executive & black');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'wireless | seat');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'apollotech <1> b340 | chair <2> in');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', '(apollotech <1> b340 & mouse <2> reliable) | chair <2> in');

Search for posts containing the keyword 'apollotech'
Search for posts containing both keywords 'executive' and 'black'
Search for posts containing one of the two keywords, either 'wireless' or 'seat'
Search for posts containing the cluster 'apollotech' standing immediately before 'b340', or the phrase 'chair' exactly 2 positions away from 'in' (meaning there is 1 word between chair and in)
Combined search: contains the cluster ('apollotech' next to 'b340' and 'mouse' 2 positions away from 'reliable') or the cluster ('chair' 2 positions away from 'in')

The limitation of this method is that it cannot find stop words and does not support partial finding

SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'the');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'aff');

Does not work because 'the' is a stop word removed from the index
Does not work because full-text search does not support partial word matching (partial finding)

Trigram

Next, create an index for Trigram as follows

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX idx_post_gin_title ON post USING gin (gin_title gin_trgm_ops);
CREATE INDEX idx_post_btree_title ON post (btree_title text_pattern_ops);

I create 1 GIN Index and 1 B-Tree Index, because B-Tree Index can also be used for TEXT values, but the differences are

B-Tree Index does not split the content but will store the entire Text value in the Index
Thus, B-Tree Index will not support partial finding and must be case-sensitive
However, the size of B-Tree Index will be much lighter than GIN Index
The text_pattern_ops class will force B-Tree to sort data by character byte (like ASCII code), helping Postgres accurately identify the data region to perform an efficient Index Scan

First, let's use the following queries with a GIN Index:

-- Query 1
SELECT * FROM post WHERE gin_title LIKE 'Data%';
SELECT * FROM post WHERE gin_title LIKE '%Part 1';
SELECT * FROM post WHERE gin_title LIKE '%Data%';

-- Query 2
SELECT * FROM post WHERE gin_title LIKE 'Da%';
SELECT * FROM post WHERE gin_title LIKE '%36';
SELECT * FROM post WHERE gin_title LIKE '%Da%';

-- Query 3
SELECT * FROM post WHERE gin_title LIKE 'da%';
SELECT * FROM post WHERE gin_title LIKE '%da%';

-- Query 4
SELECT * FROM post WHERE gin_title ILIKE 'data%';
SELECT * FROM post WHERE gin_title ILIKE '%part 1';
SELECT * FROM post WHERE gin_title ILIKE '%data%';

-- Query 5
SELECT * FROM post WHERE gin_title ILIKE 'da%';
SELECT * FROM post WHERE gin_title ILIKE '%36';
SELECT * FROM post WHERE gin_title ILIKE '%da%';

-- Query 6
SELECT * FROM post WHERE gin_title % 'real time data processing using logica replica';

-- Query 7
SELECT * FROM post WHERE 'usings' <% gin_title;

Query 1: These queries will trigger the GIN Index, but only support case-sensitive matching. As you can see, using the % wildcard at the beginning, end, or both sides of the string all work properly.

Query 2: These queries can also utilize the GIN Index. Even though a trigram splits text into 3-character chunks, you can still query with fewer characters; Postgres will automatically pad spaces to support the search.

Query 3: These queries will yield no results because using LIKE performs a case-sensitive search.

Query 4: Here, ILIKE is used for case-insensitive searching.

Query 5: Continuing with ILIKE, searching with fewer than 3 characters still works properly.

Query 6: The % operator is not a substring search (like LIKE):

It calculates the similarity percentage of 3-character chunks (trigrams) shared between the search keyword and the entire string in the column.
This query is used to find rows in gin_title that share high similarity with the string value 'real time data processing using logica replica'.

Query 7: Postgres supports the <% (word similarity) operator:

It is used to compare the keyword 'usings' specifically against the most similar word/phrase inside the string, rather than comparing it against the entire long string.
As you can see, the results do not contain 'usings', but only 'using'. Because they are sufficiently similar, matching results are still returned.

Next, let's take a look at how using a B-Tree index for TEXT searching differs.

SELECT * FROM post WHERE btree_title LIKE 'A%'
SELECT btree_title FROM post WHERE btree_title LIKE 'Awe%';
SELECT * FROM post WHERE btree_title LIKE '%zza';
SELECT * FROM post WHERE btree_title LIKE '%Bron%';
SELECT * FROM post WHERE btree_title % 'smal';

When using B-Tree Index, it only works with the A% pattern

If you only query the exact column in the Index, Postgres only needs an Index scan instead of a Heap scan

When performing a search with '%zza', '%Bron%' and the % operator, you still see results, but it is just a Seq Scan and B-Tree cannot operate

However, if you want B-Tree to work with the %text pattern, you can add an Index with the reverse function to reverse the string, the way it works is that it will bring it back to the text% format that B-Tree supports and the Index scan will be activated

CREATE INDEX idx_post_btree_title_reverse ON post (reverse(btree_title) text_pattern_ops);
SELECT * FROM post WHERE reverse(btree_title) LIKE reverse('%zza');

Thus, if the requirement is only case-sensitive searching with the text% or %text format, you can use B-Tree instead of GIN to save storage volume for the Index effectively

Happy coding!

See more articles here.