SQL Database Normalization - MSDN Documentation

Introduction to Database Normalization

Database normalization is a systematic process of organizing data in a database. The goals of normalization include:

Eliminating redundant data (duplication).
Ensuring data dependencies make sense, i.e., data is logically stored.
Improving data integrity.
Making the database more flexible and easier to maintain.

Normalization involves dividing larger tables into smaller tables and defining relationships between them. This process is guided by a series of rules called normal forms.

Why Normalize Your Database?

Normalizing a database helps prevent data anomalies, which are inconsistencies that can arise during data insertion, update, or deletion. Common anomalies include:

Insertion Anomaly: Difficulty in adding new data because some required data is missing or not yet applicable.
Update Anomaly: Needing to update the same piece of information in multiple places, leading to inconsistencies if not all instances are updated.
Deletion Anomaly: Accidentally deleting desired data when deleting unrelated data.

By organizing data efficiently, normalization reduces storage space and improves query performance, though extreme normalization can sometimes lead to performance overhead due to complex joins.

Understanding Functional Dependencies

Normalization relies heavily on the concept of functional dependencies. A functional dependency exists when the value of one attribute (or a set of attributes) determines the value of another attribute. We denote this as X → Y, meaning X determines Y.

For example, in a table of students, StudentID → StudentName. If you know the StudentID, you can determine the StudentName.

Example of Functional Dependency

Consider a table with EmployeeID, EmployeeName, and DepartmentName.

EmployeeID → EmployeeName (Each employee ID uniquely identifies an employee name)
EmployeeID → DepartmentName (Each employee ID uniquely identifies their department name)

However, if multiple employees are in the same department, DepartmentName alone does not determine EmployeeID or EmployeeName.

The Normal Forms

The most commonly used normal forms are 1NF, 2NF, and 3NF. Higher normal forms exist (BCNF, 4NF, 5NF, DKNF) but are less frequently encountered in everyday practice.

First Normal Form (1NF)

A table is in 1NF if:

It contains only atomic values (no repeating groups or multi-valued attributes within a single cell).
Each column has a unique name.
Each row is unique.

In simpler terms, each cell in the table should contain a single piece of data, and there should be no repeating columns for the same type of data.

Example: Not in 1NF

A table storing customer orders:

OrderID	CustomerID	Products
101	C001	Laptop, Mouse
102	C002	Keyboard

Here, the Products column contains multiple values, violating 1NF.

Example: In 1NF

To achieve 1NF, we can split the data into two tables:

Orders Table:

OrderID	CustomerID
101	C001
102	C002

OrderItems Table:

OrderItemID	OrderID	Product
1	101	Laptop
2	101	Mouse
3	102	Keyboard

Second Normal Form (2NF)

A table is in 2NF if:

It is in 1NF.
All non-key attributes are fully functionally dependent on the primary key. This means that if the primary key is a composite key (made up of multiple columns), no non-key attribute should be dependent on only *part* of the composite key.

2NF applies only to tables with composite primary keys.

Example: Not in 2NF

Consider a table storing course enrollments:

StudentID	CourseID	StudentName	CourseName	EnrollmentDate
S001	CS101	Alice	Introduction to CS	2023-09-01
S002	MATH203	Bob	Calculus III	2023-09-01
S001	PHYS101	Alice	University Physics	2023-09-02

Here, the primary key is the composite key (StudentID, CourseID).

StudentName is dependent only on StudentID (partial dependency).
CourseName is dependent only on CourseID (partial dependency).
EnrollmentDate is dependent on both StudentID and CourseID (full dependency).

This violates 2NF due to partial dependencies.

Example: In 2NF

Split into three tables:

Students Table:

StudentID	StudentName
S001	Alice
S002	Bob

Courses Table:

CourseID	CourseName
CS101	Introduction to CS
MATH203	Calculus III
PHYS101	University Physics

Enrollments Table:

StudentID	CourseID	EnrollmentDate
S001	CS101	2023-09-01
S002	MATH203	2023-09-01
S001	PHYS101	2023-09-02

Third Normal Form (3NF)

A table is in 3NF if:

It is in 2NF.
There are no transitive dependencies. A transitive dependency exists when a non-key attribute is dependent on another non-key attribute. In other words, if A → B and B → C, and B is not a candidate key, then C is transitively dependent on A.

3NF aims to remove data that is not directly dependent on the primary key.

Example: Not in 3NF

Consider a table storing employee information and their department details:

EmployeeID	EmployeeName	DepartmentID	DepartmentName	DepartmentLocation
E101	Alice Smith	D1	Sales	New York
E102	Bob Johnson	D2	Marketing	London
E103	Charlie Brown	D1	Sales	New York

Primary key: EmployeeID.

Dependencies:

EmployeeID → EmployeeName
EmployeeID → DepartmentID
DepartmentID → DepartmentName
DepartmentID → DepartmentLocation

We have a transitive dependency: EmployeeID → DepartmentID and DepartmentID → DepartmentName. This means DepartmentName is transitively dependent on EmployeeID via DepartmentID. If the DepartmentLocation of 'Sales' changes, we'd have to update it for both E101 and E103.

Example: In 3NF

Split into two tables:

Employees Table:

EmployeeID	EmployeeName	DepartmentID
E101	Alice Smith	D1
E102	Bob Johnson	D2
E103	Charlie Brown	D1

Departments Table:

DepartmentID	DepartmentName	DepartmentLocation
D1	Sales	New York
D2	Marketing	London

Higher Normal Forms

Boyce-Codd Normal Form (BCNF): A stricter version of 3NF. A relation is in BCNF if for every non-trivial functional dependency X → Y, X is a superkey.

Fourth Normal Form (4NF): Deals with multi-valued dependencies. If a table has multiple independent multi-valued attributes, they should be separated into different tables.

Fifth Normal Form (5NF): Deals with join dependencies, ensuring that if a table can be reconstructed by joining it with other tables, it's in 5NF.

These higher normal forms are important for complex database designs, especially in theoretical contexts or for specific advanced applications.

Denormalization: When and Why?

While normalization aims to reduce redundancy, sometimes it can lead to performance issues due to the need for numerous joins to retrieve data. Denormalization is the process of intentionally introducing redundancy back into a database design, typically by combining tables or adding calculated fields, to improve read performance.

Denormalization should be approached cautiously and only after careful analysis, as it can reintroduce data anomalies if not managed properly.

Common reasons for denormalization include:

Improving query speed for frequently accessed, complex reports.
Reducing the number of joins required for common operations.
Simplifying certain application logic.

Practical Tips for Normalization

Start with 3NF: Aiming for 3NF is generally a good balance between data integrity and complexity.
Identify Keys and Dependencies: Thoroughly understand your data and its relationships to correctly identify primary keys and functional dependencies.
Use Tools: Database design tools can assist in visualizing schemas and identifying potential normalization issues.
Document Your Design: Keep clear documentation of your tables, columns, relationships, and the normalization level achieved.
Consider Performance: If performance becomes an issue, evaluate whether denormalization is a suitable solution, but always prioritize data integrity first.