Normalization and Denormalization
Introduction
Data Anomalies
- Before diving into the main content of this article, we need to understand Data Anomalies, which are logical flaws or data discrepancies that occur in relational databases when proper table normalization has not been applied.
- This leads to information being stored repeatedly in a redundant manner. When data operations (Insert, Update, Delete) are performed, they cause the database to fall into an inconsistent state where data is correct in some places but incorrect in others.
Data Anomalies are divided into 3 main types:
- Insertion Anomaly: Occurs when you cannot add a new record into the database because the system forces you to enter another piece of information that does not yet exist.
- Deletion Anomaly: Occurs when you delete one piece of information but accidentally lose another completely different and important piece of information.
- Update Anomaly: Occurs when a piece of information is repeated across too many rows. As a result, when you edit that information, you must modify it in all rows. If even a single row is missed, data will be contradictory where some places show one thing and other places show another.
Example
Suppose there is a summary table containing information including [Student ID, Student Name, Class ID, Classroom Name].
- Insertion Anomaly: If the school wants to open a new class but no students have registered yet, it is impossible to add a record into the database because doing so requires entering a Student ID as the primary key which does not exist yet since no students have registered.
- Deletion Anomaly: If all participating students in a class drop out, when you delete the corresponding student records, you will realize that you have also completely deleted the information about that class.
- Update Anomaly: If the table storage approach above is maintained:
- When 100 students register for 1 class, the data must store that exact same class information 100 times.
- When updating information for the class, you must also edit all 100 records. If for any reason you miss just 1 record, it will lead to data discrepancies. At that point, students taking the same class will end up with different class information.
You can see that the reason behind these anomaly issues is combining too many different entities into a single table, causing data duplication. To resolve this, we need to apply Normalization.
Normalization
- This is the process of organizing and restructuring tables in a database to minimize data redundancy (repeating the same information) and eliminate anomalies when performing Insert, Update and Delete operations.
- Technically, Normalization achieves this by splitting a large, bulky table into multiple specialized smaller tables and establishing relationships between them using keys (Primary Key, Foreign Key). This process is divided into levels from low to high called Normal Forms.
- The main goal is to ensure Data Integrity so that each piece of information resides in exactly one place.
- It is suitable for Write-Heavy systems like order management systems or banking systems, where data needs to be absolutely accurate when updated.
Normal Form
Normal Form (NF) is a set of rules for calculating table structures, ranking from low to high. The higher the Normal Form your database achieves, the cleaner it becomes with less data duplication. To reach a higher normal form, the data table must satisfy all the conditions of the preceding lower normal forms. Below are the definitions of the popular NFs:
- First Normal Form (1NF): Atomicity
- A table achieves 1NF if and only if each cell (the intersection of a row and a column) contains only a single value, and cannot contain a set, an array or a list of values. At the same time, the table must have a Primary Key to distinguish rows.
- A specific example is that you cannot store multiple phone numbers or multiple item names inside a single cell in the table. Each cell must contain exactly 1 item.
- Second Normal Form (2NF): Full Functional Dependency
- A table achieves 2NF when it meets 1NF and all non-key columns must fully depend on the entire Primary Key. This rule usually applies to tables with a "Composite Primary Key" (a primary key made up of multiple columns combined).
- For example, if the primary key of the table is [order_id + product_id], then a Delivery Address column violates the rule because it only depends on order_id and has nothing to do with product_id (this is called a partial dependency). Therefore, the table needs to be split.
- Third Normal Form (3NF): No Transitive Dependency
- A table achieves 3NF when it meets 2NF and no non-key attribute transitively depends on the primary key. This means if column A determines column B, and column B determines column C, then column C must not reside in the same table as column A.
- For example, in an Employee table, Employee ID determines Department ID, and from Department ID one can deduce Department Name. Thus, Department Name has a transitive dependency. According to 3NF, you must split Department Name into a separate Department table.
- Boyce-Codd Normal Form (BCNF): Also known as 3.5NF
- This is a stricter, upgraded version of 3NF. A table achieves BCNF when it meets 3NF and every attribute acting as a Determinant must be a Superkey.
- BCNF was created to resolve a minor issue in 3NF where a non-key column is still allowed to determine a part of a composite primary key. BCNF blocks this completely, forcing all data dependency relationships to originate from Keys. In addition, there are other higher Normal Forms, but for most practical use cases, using
3NForBCNFis sufficient to meet requirements.
Advantages
- Saves storage space: Information is not stored repeatedly.
- Data Integrity: When changes occur, you only need to modify the data in 1 place across the entire system.
- Easy to extend: The structure is coherent and clear, just like separate building blocks.
Disadvantages
- More complex queries: Querying data in 1 table often does not provide enough information, so you will have to join multiple tables together.
- Read Performance can decrease: Joining too many tables will consume more CPU resources.
Denormalization
- This is the process of intentionally reintroducing data redundancy into a previously normalized database. By combining tables or adding columns that contain duplicate or pre-computed data, Denormalization helps the system increase Query Speed.
- It is not careless database design from the start, but rather a calculated performance optimization strategy. When data is over-normalized, SELECT statements have to JOIN too many tables together, slowing down the system. Denormalization accepts sacrificing storage space and resources during Write operations to exchange for extremely fast speeds during Read operations.
- The main goal is to optimize Read Performance.
- It is suitable for Read-Heavy systems or reporting and data analysis systems (Data Warehouse, Dashboard), where users need to view aggregated information instantly.
Strategic Denormalization
Here are the strategies commonly used for denormalization:
- Storing Pre-computed Values: Examples include the total amount of an order, the total number of posts by a user or the average score of a student.
- Table Flattening: If the application always calls User along with Address, instead of splitting them into 2 tables, you should merge the address components (House number, Street, District) directly into the User table so that a table JOIN is not needed.
- Using Materialized Views in Postgres: This is an excellent feature. You still keep the database in a proper 3NF form, but you create a "Materialized View" (a form of cached table) to pre-store the results of that complex JOIN statement. You can also refresh to update the view as needed.
The problems
When choosing Strategic Denormalization, you must face Write Performance & Complexity issues such as:
- More complex code: Storing the same data in multiple places means that when you Insert/Update/Delete, you must implement logic to synchronize data to other places to avoid data consistency errors.
- Slower write operations: Due to having to sync data in multiple places, this process brings along updates for indexes or even triggers, making it slower than just writing to a single table as usual.
Detail
The following is a Database Architecture design for an eCommerce system, applying Normalization and Denormalization as follows:
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL,
phone VARCHAR(15) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
description TEXT,
category_id INT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE product_variants (
id BIGSERIAL PRIMARY KEY,
product_id BIGINT REFERENCES products(id),
sku VARCHAR(100) UNIQUE NOT NULL,
variation_name VARCHAR(100) NOT NULL,
price DECIMAL(12, 2) NOT NULL,
stock INT NOT NULL CHECK (stock >= 0)
);
CREATE TABLE carts (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users(id) UNIQUE,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE cart_items (
id BIGSERIAL PRIMARY KEY,
cart_id BIGINT REFERENCES carts(id) ON DELETE CASCADE,
variant_id BIGINT REFERENCES product_variants(id),
quantity INT NOT NULL CHECK (quantity > 0),
UNIQUE(cart_id, variant_id)
);
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT REFERENCES users(id),
status VARCHAR(20) NOT NULL,
total_price DECIMAL(12, 2) NOT NULL,
shipping_address TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE order_items (
id BIGSERIAL PRIMARY KEY,
order_id BIGINT REFERENCES orders(id),
variant_id BIGINT REFERENCES product_variants(id),
quantity INT NOT NULL,
price_at_purchase DECIMAL(12, 2) NOT NULL
);
Normalization
- Applying
1NF: Atomicity- Create a separate product_variants table to store information about variants and stock quantities, instead of storing everything inside a single row in the
productstable.
- Create a separate product_variants table to store information about variants and stock quantities, instead of storing everything inside a single row in the
- Applying
2NF: Eliminating partial dependency- The
cart_itemstable is an intermediary table that resolves the many-to-many relationship between carts andproduct_variants. It only contains the key pair (cart_id, variant_id) and quantity as shared information, without columns containing specific details of the cart orproduct_variantsbecause that would violate partial dependency.
- The
- Applying
3NF: Eliminating transitive dependency- The
product_variantstable hasproduct_idbut does not contain specific data about the product. - The
cartsandorderstables both haveuser_idbut do not contain specific data about the user. - Designing this way avoids storing redundant data and transitive dependencies.
- The
BCNF- According to the schema above, once
3NFis achieved,BCNFis also automatically met because there is no regular column that can determine back a part of a composite primary key. - Tables like
cart_itemsandorder_itemswhich have composite primary keys do not violateBCNF.
- According to the schema above, once
Strategic Denormalization
- Historical Snapshot
- The
orderstable adds theshipping_addresscolumn (according to3NF,addressshould belong to theuserstable). - The
order_itemstable adds theprice_at_purchasecolumn (according to 3NF, price should belong to theproduct_variantstable). - If
Denormalizationis not used, any future changes in product prices or users changing their delivery addresses will cause past purchase orders to show incorrect price and delivery address information (they would all be changed to the new information). - Applying
Denormalizationthis way will preserve the product price and delivery address information exactly at the moment the user placed the order. This accepts data redundancy but protects the correctness of financial history.
- The
- Pre-computed
- The
orderstable adds thetotal_pricecolumn to store the total order amount calculated right when the user places the order. - Because the system must have a feature for users to review their purchase history along with the price of each order.
- If it follows 3NF and stores information separately, every time a query is made for this feature, the system has to recalculate the total price of each order for the user.
- Calculating this value once beforehand and storing it for later use will help reduce calculation costs for subsequent queries.
- The
Happy coding!
Comments
Post a Comment