Explanation of PostgreSQL Operations
Introduction
This article will provide some basic concepts regarding the internal mechanics of PostgreSQL as a foundation for explanations in future articles.
Heap
- First of all, it is necessary to distinguish the Heap in Postgres from Max-Heap or Min-Heap in data structures and algorithms, as well as Heap memory in RAM, they share the same name but are completely different. The meaning of the name Heap in Postgres signifies a jumble or a lack of order, because row data (called tuples) are stacked on top of each other randomly.
- In PostgreSQL, the Heap is the main storage area containing all the actual data of a table. When you create a table and INSERT rows of data, Postgres writes those rows into the Heap.
- Regarding storage, the Heap is simply one or more binary files created by Postgres when you insert data into a table. This Heap file is stored on disk and has a default limit of 1GB per file, if there is more data in the table, it will be split into multiple files for storage. Wherever there is empty space available in the storage file, Postgres will insert data right there.
- Separation from Index: The Heap only contains raw data. If you want to find data quickly, you must use an Index. An Index is also a binary file stored independently from the Heap but it only contains data from indexed columns along with pointers containing information to access the corresponding row data in the Heap.
Page
- Pages are small segments of equal size, with a default of 8KB, mechanically divided from beginning to end within a Heap file.
- Forcing Pages to have a fixed size (8KB = 8192 Bytes) allows Postgres to calculate the exact position of each Page on the hard drive using simple multiplication operations, ensuring that the offset remains constant.
- For example, when storing pointer information to access a page in the Heap, if it needs to read Page 0, the operating system will read from byte 0. If it wants to read Page 2, the operating system simply needs to seek to byte 16384 (8192 * 2) of the file to read. This makes read and write operations extremely fast.
Data Operations
- Although the Heap is a file stored on physical disk, to speed up read and write operations, Postgres does not interact directly with the file on disk. Instead, when you execute a query that modifies data, it loads the data into a cache area on RAM called Shared Buffers.
- Postgres will check which Pages contain the required rows, if they are not already on RAM, it will perform the action of loading those Pages onto RAM before executing insert, delete or update operations, after which it aggregates and writes them down to disk later.
- Even if you only query a single row, it must load the entire Page for processing rather than loading that single row individually.
Insert Data
When you execute an INSERT command for a new row of data, Postgres performs the following steps:
- Find an empty Page: Postgres will check the FSM (Free Space Map) to see if there is any Page with enough empty space to insert this row data. If all old Pages are full, it will create a new Page at the end of the Heap file.
- Write data from the bottom up: The new row of data will be pushed into the bottom of the Free Space.
- Write the index from the top down: A new Pointer is created at the top of the Free Space to point to the row data that was just written.
- Why stack from both ends towards the middle? This design choice is extremely clever. Since Postgres does not know in advance how many rows you will insert into a page (some rows are short, others are long), letting the index run from the top down and data run from the bottom up maximizes the utilization of the gap in the middle without any risk of waste or overlap.
- Update the FSM: there are cases where an old Page has empty space but not enough size to insert the row data, in this situation, Postgres will still create a new Page to store that row and record the status information of that old empty Page so that when new data arrives, it can insert the row there.
Delete Data
- PostgreSQL uses a mechanism called MVCC (Multi-Version Concurrency Control) to manage data versions.
- Therefore, when executing a DELETE, Postgres does not immediately erase the data on the Page. It merely places a flag on that row indicating it is a dead tuple, showing that this row has been deleted, but in terms of storage, that space remains occupied.
- Due to this mechanism, after a period of operation, a phenomenon called bloat will occur, expanding the data size when Pages become filled with dead tuples, vastly increasing storage size and disk space in a wasteful manner and slowing down data queries.
Update Data
- When executing an UPDATE, Postgres operates in a manner similar to a Delete combined with an Insert.
- It does not modify the row data directly but marks the old row with a dead tuple flag, then INSERTS a completely new row with the new data into the remaining free space of the Page, or moves to another Page if that Page runs out of space.
VACUUM
To prevent Pages from holding too many rows flagged as dead tuples, which wastes storage capacity, Postgres features an Auto-Vacuum mechanism, when it scans through a Page, it operates as follows:
- It completely clears out all rows flagged as dead tuples.
- It compacts the remaining active rows down to the bottom of the Page, reclaiming the empty space in the middle, thereby ensuring Pages have room to continue inserting data.
WAL
- WAL (Write-Ahead Logging) is a recovery mechanism that Postgres uses to restore data in case the system stops operating in the middle of processing (possibly due to a system crash or power outage).
- As mentioned, Postgres data is stored on disk, which is Persistent Storage, but operations that change data all need to load Pages onto RAM to ensure speed. After making modifications within the Pages (at this point they are called Dirty Pages), this data is not immediately rewritten into the Heap on disk but is written into the WAL file first.
- The reason is that writing to a WAL file is performed as a direct Sequential Write, which is much faster than searching for each individual Page across Heap files to modify data, which consumes more time. Only when writing to the WAL file is successful will Postgres return a Commit OK response.
- Once the WAL file has been successfully written, if the database system happens to shut down (when data has not yet been saved to the Heap), Postgres will rely on the WAL file to recover the data.
- Data written to the WAL file includes information about the Page that needs modification and the necessary data, so that when a recovery is required, Postgres still knows which Pages to modify along with what information.
- The process of saving data down to the Heap file on disk is as follows:
- The Dirty Pages remain on RAM like that to serve subsequent commands, because when you want to read or modify them further, the data is already available on RAM, avoiding the need to load from disk again.
- Postgres will aggregate these Dirty Pages and write them into the Heap file via two background processes:
- BgWriter (Background Writer): used to check RAM to grab Dirty Pages and write them down to disk.
- Checkpointer: after a pre-configured time interval (default is every 5 minutes) or when the WAL file is full, the Checkpoint process is triggered. It forces the system to write all remaining Dirty Pages from RAM into the Heap file.
Detail
You can use the following query to view the data directory:
SHOW data_directory;
Next, use this query to get the OID and Filenode:
SELECT current_database(), d.oid AS db_oid, c.relfilenode
FROM pg_class c
JOIN pg_database d ON d.datname = current_database()
WHERE c.relname = 'Category';
After that, simply combine them as follows: `{path}/base/[Database_OID]/[Table_Filenode]`, as in the Docker example.
However, since that is a binary file, it is almost impossible to view, if you want to inspect the Page information within the Heap, do the following
CREATE EXTENSION pageinspect;
SELECT * FROM page_header(get_raw_page('"public"."Category"', 0));
- pageinspect: the extension required to view Page information
- public: schema
- Category: table name
- 0: the Page from which to retrieve information
Explanation of values:
lsn(Log Sequence Number) is a pointer indicating the position of the last WAL (Write-Ahead Log) record that modified data on this page. Postgres uses the LSN to ensure data integrity (if the system crashes, it relies on the LSN to know how far this page has been written in order to recover).checksumis a data safety check code, which is a numerical value calculated from all the data within the page. When Postgres reads this page from disk up to RAM, it recalculates the checksum. If the recalculated number differs from the original value (such as-28659) stored here, Postgres knows this page has been corrupted due to disk or hardware errors and will block it to protect the system.flagsis a bitmap representing the state of the page. In Postgres, a value of4(0x0004) typically corresponds to thePD_PAGE_FULLflag. It signals to the system that this page is currently full (or nearly full), so it should not attempt to INSERT into here anymore to save search time.lowerandupperare the two most important parameters to understand how a Page operates because data inside a Page is stored in a "both ends towards the middle" fashion:lower: The ending point of the area containing Item Pointers (rows containing pointers that point to the actual data location). This region grows from the top down. The number764means the pointers occupy from byte 24 (past the Header) to byte764.upper: The starting point of the area containing Tuple Data (the actual data of the rows). This region grows from the bottom up. The number792means the actual data is stacked backwards from byte8192up to byte792.- Free Space: The distance between
lowerandupperis precisely the remaining empty space of the page. Here: 792 - 764 = 28, meaning this page has exactly 28 bytes of free space left, which aligns with why theflagsparameter reports4(Page Full). specialis the starting point of the Special Space. This region is located at the bottom of the Page, usually reserved only for Index pages (such as B-Tree to store links between pages). For a regular table (Heap Page), this region is unused, so the number8192means this special region starts at the very last byte (meaning its size is 0).pagesizeis the size of the page, which is 8192 bytes = 8KB.versionis the layout format version of the Page (Page Layout Version). The number4is the standard page layout version for PostgreSQL from version 8.3 up to the latest versions today.prune_xidis the oldest transaction ID that can be cleaned up. This field stores the ID of the Transaction that deleted or updated old data in this page, helping theVACUUMprocess know that it can enter to prune dead tuples. A value of0means that currently this page has no data that can be pruned.
- First, create a few rows of data (about 5 rows).
- Next, delete a row in the middle (such as row 3).
- Then, update the column data of the very first row.
- After querying the data again, you will see that the first row has now moved to the position where row 3 used to be.
Happy coding!
Comments
Post a Comment