GIN Index with TEXT

Introduction

  • In this article, we will continue to explore how to use GIN Index with the text data type
  • First, please note that GIN Index only works with multi-valued data types, meaning fields containing multiple values, so you can easily create an index with the array type
  • However, if you create a similar index with the text type, an error will be reported because text is only a single-valued type, containing only one content string, so you must pass an additional function that defines how to split the text value into items in an array before the index can be created
  • If you remember, when using GIN Index with a text array, you will encounter limitations regarding partial search and must enter the exact word to search for it to work, now you can solve that problem by using text combined with two functions, to_tsvector and pg_trgm

Full-Text Search

This method uses the to_tsvector function and its index creation process will be as follows

  • Tokenization & Normalization

    • Split the text value into individual words based on spaces and punctuation
    • Lowercasing
    • Remove Stop Words, which are words that appear too frequently but carry no search value, such as the words the, a, an and and
    • Reduce to root words, for example, running and runs both become run
  • Entry Creation

    • At this point, long text content only contains the main meaningful words, so you have a list of words just like when using a Text array
    • The next process is similar to working with an array, where the items will be used as Entries along with a Posting list containing information to access the row
    • However, its Posting list will also contain additional information about the position of that item in the original text, because full-text search also supports searching related to the position of words
  • Thus, using this method has solved partial search and no longer has case-sensitivity constraints

  • As for the removed stop words not being searchable, it is not really a problem because these are considered words that do not carry specific meaning to form the content of a sentence, so when searching, users will not use these words to look for real content

Trigram

This method uses the pg_trgm function, as the name suggests, you can also guess that it will automatically split the text value into a list of items, where each item has 3 characters This number of 3 characters is fixed and you cannot change it because the Postgres team considers this to be the perfect balance between storage volume and performance

  • You will not have to create a list of too many meaningless sets as when splitting by 2 characters, causing volume waste
  • Nor do you have to create a set containing too many characters such as 4 characters, which would make it impossible to search for a smaller number of characters

The text search process will be as follows

  • Postgres will perform Tokenization & Normalization on your search content according to Full-text search or Trigram to split it into corresponding string lists
  • Next, use GIN Index to search that list by Entry, then extract the list of TIDs
  • Based on the TID, extract the row values and then perform a recheck with the raw data to ensure accurate results

Thus, you can see that the size of the GIN Index is directly proportional to the stored Text content

  • If the content does not duplicate much, the Entry will be larger
  • If the content duplicates a lot, the Posting list will be larger

Detail

First, let's create a table as follows

CREATE TABLE post (
    id SERIAL PRIMARY KEY,
    gin_title TEXT,
    B-Tree_title TEXT,
    english_description TEXT,
    simple_description TEXT
);

If you try to create a normal Index as with the array type, you will see an error indicating a missing default operator class, because TEXT is a single-valued type, so a GIN Index cannot be created directly

CREATE INDEX idx_post_gin_title ON post USING gin(gin_title);

Full-Text Search

First, let's look at the examples of Full-Text Search

CREATE INDEX idx_post_simple_description ON post USING gin (to_tsvector('simple', simple_description));
CREATE INDEX idx_post_english_description ON post USING gin (to_tsvector('english', english_description));
  • When you use to_tsvector with simple, it simply splits the text into clusters of 3 characters and lowercases them
  • When used with english or a specific language, stop words will be deleted and the remaining words will be converted to their origin

The usage examples are very diverse as follows

SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'apollotech');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'executive & black');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'wireless | seat');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'apollotech <1> b340 | chair <2> in');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', '(apollotech <1> b340 & mouse <2> reliable) | chair <2> in');
  • Search for posts containing the keyword 'apollotech'
  • Search for posts containing both keywords 'executive' and 'black'
  • Search for posts containing one of the two keywords, either 'wireless' or 'seat'
  • Search for posts containing the cluster 'apollotech' standing immediately before 'b340', or the phrase 'chair' exactly 2 positions away from 'in' (meaning there is 1 word between chair and in)
  • Combined search: contains the cluster ('apollotech' next to 'b340' and 'mouse' 2 positions away from 'reliable') or the cluster ('chair' 2 positions away from 'in')


The limitation of this method is that it cannot find stop words and does not support partial finding

SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'the');
SELECT * FROM post WHERE to_tsvector('english', english_description) @@ to_tsquery('english', 'aff');
  • Does not work because 'the' is a stop word removed from the index
  • Does not work because full-text search does not support partial word matching (partial finding)

Trigram

Next, create an index for Trigram as follows

CREATE INDEX idx_post_gin_title ON post USING gin (gin_title gin_trgm_ops);
CREATE INDEX idx_post_B-Tree_title ON post (B-Tree_title text_pattern_ops);
  • I create 1 GIN Index and 1 B-Tree Index, because B-Tree Index can also be used for TEXT values, but the differences are
    • B-Tree Index does not split the content but will store the entire Text value in the Index
    • Thus, B-Tree Index will not support partial finding and must be case-sensitive
    • However, the size of B-Tree Index will be much lighter than GIN Index
    • The text_pattern_ops class will force B-Tree to sort data by character byte (like ASCII code), helping Postgres accurately identify the data region to perform an efficient Index Scan
SELECT * FROM post WHERE gin_title LIKE '%Awe%';
SELECT * FROM post WHERE gin_title LIKE 'A%';
SELECT * FROM post WHERE gin_title LIKE '%zza';
SELECT * FROM post WHERE gin_title % 'smal';
SELECT gin_title FROM post WHERE gin_title LIKE '%w%';

Because Trigram only splits words and does not lowercase, the search text must be case-sensitive

  • Search for rows where gin_title contains the string 'Awe'
  • Search for rows where gin_title starts with the character 'A'
  • Search for rows where gin_title ends with the string 'zza'
  • Search for rows that have string similarity with the word 'smal' based on the Trigram operator %, which is used in cases allowing mild typo searching
  • Search containing 'w' (less than 3 characters), note that at this time, GIN Index cannot work and will switch to seq scan


SELECT * FROM post WHERE btree_title LIKE 'A%'
SELECT btree_title FROM post WHERE btree_title LIKE 'Awe%';
SELECT * FROM post WHERE btree_title LIKE '%zza';
SELECT * FROM post WHERE btree_title LIKE '%Bron%';
SELECT * FROM post WHERE btree_title
% 'smal';

When using B-Tree Index, it only works with the A% pattern


If you only query the exact column in the Index, Postgres only needs an Index scan instead of a Heap scan


When performing a search with '%zza', '%Bron%' and the % operator, you still see results, but it is just a Seq Scan and B-Tree cannot operate


However, if you want B-Tree to work with the %text pattern, you can add an Index with the reverse function to reverse the string, the way it works is that it will bring it back to the text% format that B-Tree supports and the Index scan will be activated

CREATE INDEX idx_post_btree_title_reverse ON post (reverse(btree_title) text_pattern_ops);
SELECT * FROM post WHERE reverse(btree_title)
LIKE reverse('%zza');


Thus, if the requirement is only case-sensitive searching with the text% or %text format, you can use B-Tree instead of GIN to save storage volume for the Index effectively

Happy coding!

See more articles here.

Comments

Popular posts from this blog

All Practice Series

Kubernetes Deployment for Zero Downtime

Deploying a NodeJS Server on Google Kubernetes Engine

Sitemap

Setting up Kubernetes Dashboard with Kind

React Practice Series

Monitoring with cAdvisor, Prometheus and Grafana on Docker

DevOps Practice Series

Helm for beginer - Deploy nginx to Google Kubernetes Engine

Using Kafka with Docker and NodeJS