Explanation of PostgreSQL Operations

Introduction

This article will provide some basic concepts regarding the internal mechanics of PostgreSQL as a foundation for explanations in future articles.

Heap

First of all, it is necessary to distinguish the Heap in Postgres from Max-Heap or Min-Heap in data structures and algorithms, as well as Heap memory in RAM, they share the same name but are completely different. The meaning of the name Heap in Postgres signifies a jumble or a lack of order, because row data (called tuples) are stacked on top of each other randomly.
In PostgreSQL, the Heap is the main storage area containing all the actual data of a table. When you create a table and INSERT rows of data, Postgres writes those rows into the Heap.
Regarding storage, the Heap is simply one or more binary files created by Postgres when you insert data into a table. This Heap file is stored on disk and has a default limit of 1GB per file, if there is more data in the table, it will be split into multiple files for storage. Wherever there is empty space available in the storage file, Postgres will insert data right there.
Separation from Index: The Heap only contains raw data. If you want to find data quickly, you must use an Index. An Index is also a binary file stored independently from the Heap but it only contains data from indexed columns along with pointers containing information to access the corresponding row data in the Heap.

Page

Pages are small segments of equal size, with a default of 8KB, mechanically divided from beginning to end within a Heap file.
Forcing Pages to have a fixed size (8KB = 8192 Bytes) allows Postgres to calculate the exact position of each Page on the hard drive using simple multiplication operations, ensuring that the offset remains constant.
For example, when storing pointer information to access a page in the Heap, if it needs to read Page 0, the operating system will read from byte 0. If it wants to read Page 2, the operating system simply needs to seek to byte 16384 (8192 * 2) of the file to read. This makes read and write operations extremely fast.

Data Operations

Although the Heap is a file stored on physical disk, to speed up read and write operations, Postgres does not interact directly with the file on disk. Instead, when you execute a query that modifies data, it loads the data into a cache area on RAM called Shared Buffers.
Postgres will check which Pages contain the required rows, if they are not already on RAM, it will perform the action of loading those Pages onto RAM before executing insert, delete or update operations, after which it aggregates and writes them down to disk later.
Even if you only query a single row, it must load the entire Page for processing rather than loading that single row individually.

Insert Data

When you execute an INSERT command for a new row of data, Postgres performs the following steps:

Find an empty Page: Postgres will check the FSM (Free Space Map) to see if there is any Page with enough empty space to insert this row data. If all old Pages are full, it will create a new Page at the end of the Heap file.
Write data from the bottom up: The new row of data will be pushed into the bottom of the Free Space.
Write the index from the top down: A new Pointer is created at the top of the Free Space to point to the row data that was just written.
Why stack from both ends towards the middle? This design choice is extremely clever. Since Postgres does not know in advance how many rows you will insert into a page (some rows are short, others are long), letting the index run from the top down and data run from the bottom up maximizes the utilization of the gap in the middle without any risk of waste or overlap.
Update the FSM: there are cases where an old Page has empty space but not enough size to insert the row data, in this situation, Postgres will still create a new Page to store that row and record the status information of that old empty Page so that when new data arrives, it can insert the row there.

Delete Data

PostgreSQL uses a mechanism called MVCC (Multi-Version Concurrency Control) to manage data versions.
Therefore, when executing a DELETE, Postgres does not immediately erase the data on the Page. It merely places a flag on that row indicating it is a dead tuple, showing that this row has been deleted, but in terms of storage, that space remains occupied.
Due to this mechanism, after a period of operation, a phenomenon called bloat will occur, expanding the data size when Pages become filled with dead tuples, vastly increasing storage size and disk space in a wasteful manner and slowing down data queries.

Update Data

When executing an UPDATE, Postgres operates in a manner similar to a Delete combined with an Insert.
It does not modify the row data directly but marks the old row with a dead tuple flag, then INSERTS a completely new row with the new data into the remaining free space of the Page, or moves to another Page if that Page runs out of space.

VACUUM

To prevent Pages from holding too many rows flagged as dead tuples, which wastes storage capacity, Postgres features an Auto-Vacuum mechanism, when it scans through a Page, it operates as follows:

It completely clears out all rows flagged as dead tuples.
It compacts the remaining active rows down to the bottom of the Page, reclaiming the empty space in the middle, thereby ensuring Pages have room to continue inserting data.

WAL

WAL (Write-Ahead Logging) is a recovery mechanism that Postgres uses to restore data in case the system stops operating in the middle of processing (possibly due to a system crash or power outage).
As mentioned, Postgres data is stored on disk, which is Persistent Storage, but operations that change data all need to load Pages onto RAM to ensure speed. After making modifications within the Pages (at this point they are called Dirty Pages), this data is not immediately rewritten into the Heap on disk but is written into the WAL file first.
The reason is that writing to a WAL file is performed as a direct Sequential Write, which is much faster than searching for each individual Page across Heap files to modify data, which consumes more time. Only when writing to the WAL file is successful will Postgres return a Commit OK response.
Once the WAL file has been successfully written, if the database system happens to shut down (when data has not yet been saved to the Heap), Postgres will rely on the WAL file to recover the data.
Data written to the WAL file includes information about the Page that needs modification and the necessary data, so that when a recovery is required, Postgres still knows which Pages to modify along with what information.
The process of saving data down to the Heap file on disk is as follows:

The Dirty Pages remain on RAM like that to serve subsequent commands, because when you want to read or modify them further, the data is already available on RAM, avoiding the need to load from disk again.
Postgres will aggregate these Dirty Pages and write them into the Heap file via two background processes:

BgWriter (Background Writer): used to check RAM to grab Dirty Pages and write them down to disk.
Checkpointer: after a pre-configured time interval (default is every 5 minutes) or when the WAL file is full, the Checkpoint process is triggered. It forces the system to write all remaining Dirty Pages from RAM into the Heap file.

Detail

You can use the following query to view the data directory:

SHOW data_directory;

Next, use this query to get the OID and Filenode:

SELECT current_database(), d.oid AS db_oid, c.relfilenode 
FROM pg_class c 
JOIN pg_database d ON d.datname = current_database()
WHERE c.relname = 'Category';

After that, simply combine them as follows: `{path}/base/[Database_OID]/[Table_Filenode]`, as in the Docker example.

However, since that is a binary file, it is almost impossible to view, if you want to inspect the Page information within the Heap, do the following

CREATE EXTENSION pageinspect;
SELECT * FROM page_header(get_raw_page('"public"."Category"', 0));

pageinspect: the extension required to view Page information
public: schema
Category: table name
0: the Page from which to retrieve information

Explanation of values:

lsn (Log Sequence Number) is a pointer indicating the position of the last WAL (Write-Ahead Log) record that modified data on this page. Postgres uses the LSN to ensure data integrity (if the system crashes, it relies on the LSN to know how far this page has been written in order to recover).
checksum is a data safety check code, which is a numerical value calculated from all the data within the page. When Postgres reads this page from disk up to RAM, it recalculates the checksum. If the recalculated number differs from the original value (such as -28659) stored here, Postgres knows this page has been corrupted due to disk or hardware errors and will block it to protect the system.
flags is a bitmap representing the state of the page. In Postgres, a value of 4 (0x0004) typically corresponds to the PD_PAGE_FULL flag. It signals to the system that this page is currently full (or nearly full), so it should not attempt to INSERT into here anymore to save search time.
lower and upper are the two most important parameters to understand how a Page operates because data inside a Page is stored in a "both ends towards the middle" fashion:

lower: The ending point of the area containing Item Pointers (rows containing pointers that point to the actual data location). This region grows from the top down. The number 764 means the pointers occupy from byte 24 (past the Header) to byte 764.
upper: The starting point of the area containing Tuple Data (the actual data of the rows). This region grows from the bottom up. The number 792 means the actual data is stacked backwards from byte 8192 up to byte 792.
Free Space: The distance between lower and upper is precisely the remaining empty space of the page. Here: 792 - 764 = 28, meaning this page has exactly 28 bytes of free space left, which aligns with why the flags parameter reports 4 (Page Full).

special is the starting point of the Special Space. This region is located at the bottom of the Page, usually reserved only for Index pages (such as B-Tree to store links between pages). For a regular table (Heap Page), this region is unused, so the number 8192 means this special region starts at the very last byte (meaning its size is 0).
pagesize is the size of the page, which is 8192 bytes = 8KB.
version is the layout format version of the Page (Page Layout Version). The number 4 is the standard page layout version for PostgreSQL from version 8.3 up to the latest versions today.
prune_xid is the oldest transaction ID that can be cleaned up. This field stores the ID of the Transaction that deleted or updated old data in this page, helping the VACUUM process know that it can enter to prune dead tuples. A value of 0 means that currently this page has no data that can be pruned.

Here is an example to show you that data stored in a HEAP does not follow any specific order, but simply finds an empty space to be inserted into. You can follow these steps:

First, create a few rows of data (about 5 rows).
Next, delete a row in the middle (such as row 3).
Then, update the column data of the very first row.
After querying the data again, you will see that the first row has now moved to the position where row 3 used to be.

The reason for this is that when row 3 is deleted, it is flagged as 'dead' but its space remains reusable. Therefore, when Postgres looks for an available slot within the Page to insert the updated data of row 1, it finds row 3's empty slot and simply places the new data of row 1 right into it.