Data Normalization

Data normalization is a database design technique used to organize data in a database to reduce data redundancy and improve data integrity. It involves structuring tables in a way that each piece of data is stored in only one place, preventing inconsistencies and simplifying data management.

Why Normalize?

The Normal Forms

Normalization is achieved through a series of guidelines called normal forms. The most commonly used are the first three normal forms (1NF, 2NF, 3NF).

First Normal Form (1NF)

A table is in 1NF if:

Example of a non-1NF table:

OrderID CustomerName Products
101 Alice Smith Laptop, Mouse, Keyboard
102 Bob Johnson Monitor

In this example, the Products column contains multiple values. To achieve 1NF, we would typically split this into multiple rows or a separate related table.

Second Normal Form (2NF)

A table is in 2NF if it is in 1NF and:

Consider a table storing order items. If the primary key is a composite of OrderID and ProductID, any attribute that depends only on ProductID (like ProductName) would violate 2NF if placed in this table.

Example: If we have an OrderItems table with OrderID, ProductID, Quantity, ProductName, and ProductPrice.

To achieve 2NF, we would split this into two tables:

  1. Orders Table: OrderID (PK), OrderDate, CustomerID
  2. OrderItems Table: OrderID (PK, FK), ProductID (PK, FK), Quantity
  3. Products Table: ProductID (PK), ProductName, ProductPrice

Third Normal Form (3NF)

A table is in 3NF if it is in 2NF and:

Example: Consider a Customers table with CustomerID (PK), CustomerName, City, and State.

If the State is determined by the City (e.g., all cities in 'CA' are in California), then State is transitively dependent on CustomerID through City.

To achieve 3NF, we would split this into two tables:

  1. Customers Table: CustomerID (PK), CustomerName, CityID (FK)
  2. Cities Table: CityID (PK), CityName, State

Glossary

Atomic Value
A value that cannot be further divided into smaller, meaningful parts.
Data Integrity
The accuracy, consistency, and reliability of data stored in a database.
Data Redundancy
The unnecessary duplication of data in a database.
Primary Key
A column or set of columns that uniquely identifies each row in a table.
Foreign Key
A column or set of columns in one table that refers to the primary key in another table, establishing a link between them.
Composite Primary Key
A primary key that consists of two or more columns.
Functional Dependency
A relationship between two sets of attributes in a table, where one set uniquely determines the other.
Transitive Dependency
A dependency where a non-key attribute depends on another non-key attribute, which in turn depends on the primary key.