Last updated: October 26, 2023
Database normalization is a systematic approach to designing relational databases to reduce data redundancy and improve data integrity. It involves organizing columns and tables in a database so that a database can be maintained more easily and has less repetition.
Normalization is the process of structuring a database in accordance with a series of so-called "normal forms" in order to reduce data redundancy and improve data integrity. It's a key concept in relational database design. The primary goals are:
Without normalization, databases can suffer from several anomalies:
Normalization helps prevent these anomalies by breaking down large tables into smaller, more manageable ones and defining relationships between them.
Normalization typically involves achieving several "normal forms". The most commonly used are the First Normal Form (1NF), Second Normal Form (2NF), and Third Normal Form (3NF). Higher normal forms exist, but 3NF is often sufficient for most practical applications.
A relation is in 1NF if and only if all the underlying atomic values of its attributes are domains of discourse. In simpler terms:
Unnormalized Table (Violates 1NF):
| OrderID | CustomerName | Items (Quantity, Product) |
|---|---|---|
| 101 | Alice Smith | (2, Widget A), (1, Gadget B) |
Table in 1NF:
This requires breaking down the repeating group into separate rows or tables.
| OrderID | ItemSequence | Quantity | Product | CustomerName |
|---|---|---|---|---|
| 101 | 1 | 2 | Widget A | Alice Smith |
| 101 | 2 | 1 | Gadget B | Alice Smith |
Note: Even in 1NF, there's still redundancy with CustomerName being repeated for each order item.
A relation is in 2NF if it is in 1NF and every non-prime attribute is fully functionally dependent on every candidate key.
This means that if a table has a composite primary key (a key made up of two or more columns), no non-key attribute can be dependent on only *part* of that composite key.
Consider a table with a composite key (OrderID, ItemSequence) from the 1NF example above. If we also store ProductDescription:
| OrderID (PK) | ItemSequence (PK) | Quantity | ProductID | ProductDescription | CustomerName |
|---|---|---|---|---|---|
| 101 | 1 | 2 | WIDGETA | High-quality widget | Alice Smith |
| 101 | 2 | 1 | GADGETB | Advanced gadget | Alice Smith |
| 102 | 1 | 3 | WIDGETA | High-quality widget | Bob Johnson |
Here, ProductDescription depends only on ProductID (which is part of the composite key if ProductID were also a key, or simply a non-key attribute dependent on part of the key). This violates 2NF.
Table in 2NF:
We split this into multiple tables:
Orders Table:
| OrderID (PK) | CustomerName |
|---|---|
| 101 | Alice Smith |
| 102 | Bob Johnson |
OrderItems Table:
| OrderID (FK, PK) | ItemSequence (PK) | Quantity | ProductID (FK) |
|---|---|---|---|
| 101 | 1 | 2 | WIDGETA |
| 101 | 2 | 1 | GADGETB |
| 102 | 1 | 3 | WIDGETA |
Products Table:
| ProductID (PK) | ProductDescription |
|---|---|
| WIDGETA | High-quality widget |
| GADGETB | Advanced gadget |
Now, ProductDescription is in a table where ProductID is the primary key, satisfying 2NF.
A relation is in 3NF if it is in 2NF and every non-prime attribute is non-transitively dependent on every candidate key.
This means that non-key attributes should not be dependent on other non-key attributes. A transitive dependency exists when a non-key attribute depends on another non-key attribute, which in turn depends on the primary key.
Consider a simplified Customers table:
| CustomerID (PK) | CustomerName | City | State | StateCapital |
|---|---|---|---|---|
| C1001 | Alice Smith | New York | NY | Albany |
| C1002 | Bob Johnson | Los Angeles | CA | Sacramento |
| C1003 | Charlie Brown | Albany | NY | Albany |
Here, StateCapital (Albany) depends on State (NY), and State depends on the primary key CustomerID. This is a transitive dependency: CustomerID -> State -> StateCapital. This violates 3NF.
Table in 3NF:
We split this into two tables:
Customers Table:
| CustomerID (PK) | CustomerName | City | State (FK) |
|---|---|---|---|
| C1001 | Alice Smith | New York | NY |
| C1002 | Bob Johnson | Los Angeles | CA |
| C1003 | Charlie Brown | Albany | NY |
States Table:
| State (PK) | StateCapital |
|---|---|
| NY | Albany |
| CA | Sacramento |
Now, StateCapital is directly dependent on the primary key State in the States table, and State is directly dependent on the primary key CustomerID in the Customers table. There are no transitive dependencies, satisfying 3NF.
While normalization is crucial for design, sometimes it's necessary to denormalize a database for performance reasons. This involves strategically reintroducing some redundancy to speed up read operations, particularly in data warehousing or reporting scenarios where complex joins can be slow. However, denormalization should be done cautiously and with a clear understanding of the trade-offs.
Database normalization is a fundamental principle for building robust, maintainable, and efficient relational databases. By understanding and applying normal forms, developers can significantly reduce data redundancy, improve data integrity, and avoid common database anomalies.
For further reading, explore the concepts of Boyce-Codd Normal Form (BCNF) and higher normal forms, as well as techniques for practical database design and performance tuning.
Related Topics: Relational Model, SQL Basics, Database Design Principles