Normalization Examples
Normalization is a database design technique used to organize data in a database. Its goals are to reduce data redundancy and improve data integrity.
Understanding Normal Forms
Normalization involves a series of guidelines called normal forms. The most common ones are:
- First Normal Form (1NF): Each column must contain atomic values, and each record must be unique.
- Second Normal Form (2NF): Must be in 1NF, and all non-key attributes must be fully dependent on the primary key.
- Third Normal Form (3NF): Must be in 2NF, and all non-key attributes must not be transitively dependent on the primary key.
Example: De-normalizing a Student Table
Consider a table that stores student information and their enrolled courses. Without normalization, it might look like this:
CREATE TABLE StudentCourseEnrollment (
StudentID INT,
StudentName VARCHAR(100),
StudentAddress VARCHAR(255),
CourseID VARCHAR(10),
CourseName VARCHAR(100),
InstructorName VARCHAR(100),
PRIMARY KEY (StudentID, CourseID)
);
INSERT INTO StudentCourseEnrollment (StudentID, StudentName, StudentAddress, CourseID, CourseName, InstructorName) VALUES
(101, 'Alice Smith', '123 Main St', 'CS101', 'Introduction to Programming', 'Dr. Evans'),
(101, 'Alice Smith', '123 Main St', 'MA201', 'Calculus II', 'Prof. Davis'),
(102, 'Bob Johnson', '456 Oak Ave', 'CS101', 'Introduction to Programming', 'Dr. Evans'),
(103, 'Charlie Brown', '789 Pine Ln', 'PH101', 'General Physics', 'Dr. Lee');
This table has several issues:
- Redundancy: Student name and address are repeated for each course a student enrolls in. Course name and instructor name are repeated for each student enrolled in that course.
- Update Anomalies: If a student's address changes, it needs to be updated in multiple rows.
- Deletion Anomalies: If a student enrolls in only one course and then drops it, their name and address might be lost entirely if not handled carefully.
Applying Normal Forms
1NF: Atomic Values and Unique Records
The initial table is already in 1NF as each cell contains a single value, and the primary key (StudentID, CourseID) ensures uniqueness. However, the redundancy is apparent.
2NF: Removing Partial Dependencies
To achieve 2NF, we identify attributes dependent on only part of the primary key. In our case, `CourseName` and `InstructorName` depend only on `CourseID`. `StudentName` and `StudentAddress` depend only on `StudentID`.
We can split this into three tables:
-- Students Table
CREATE TABLE Students (
StudentID INT PRIMARY KEY,
StudentName VARCHAR(100),
StudentAddress VARCHAR(255)
);
-- Courses Table
CREATE TABLE Courses (
CourseID VARCHAR(10) PRIMARY KEY,
CourseName VARCHAR(100),
InstructorName VARCHAR(100)
);
-- Enrollment Table
CREATE TABLE Enrollments (
StudentID INT,
CourseID VARCHAR(10),
PRIMARY KEY (StudentID, CourseID),
FOREIGN KEY (StudentID) REFERENCES Students(StudentID),
FOREIGN KEY (CourseID) REFERENCES Courses(CourseID)
);
Data after 2NF:
-- Students
INSERT INTO Students (StudentID, StudentName, StudentAddress) VALUES
(101, 'Alice Smith', '123 Main St'),
(102, 'Bob Johnson', '456 Oak Ave'),
(103, 'Charlie Brown', '789 Pine Ln');
-- Courses
INSERT INTO Courses (CourseID, CourseName, InstructorName) VALUES
('CS101', 'Introduction to Programming', 'Dr. Evans'),
('MA201', 'Calculus II', 'Prof. Davis'),
('PH101', 'General Physics', 'Dr. Lee');
-- Enrollments
INSERT INTO Enrollments (StudentID, CourseID) VALUES
(101, 'CS101'),
(101, 'MA201'),
(102, 'CS101'),
(103, 'PH101');
3NF: Removing Transitive Dependencies
The tables are now in 2NF. Let's consider if there are any transitive dependencies. In our current setup, there are none. For instance, in the Students
table, StudentName
and StudentAddress
are directly dependent on StudentID
. In the Courses
table, CourseName
and InstructorName
are directly dependent on CourseID
.
A common example of a transitive dependency might be if we had an InstructorID
in the Courses
table, and also an InstructorName
and InstructorOffice
in a separate Instructors
table. If InstructorName
and InstructorOffice
were directly in the Courses
table, they would be transitively dependent on CourseID
(via InstructorID
).
Example of a transitive dependency that would require further normalization (not present in our current 2NF example):
If Courses
table had CourseID, CourseName, InstructorName, InstructorOffice
, and InstructorName
and InstructorOffice
were also in a separate Instructors
table linked by InstructorID
. Then InstructorName
and InstructorOffice
would be transitively dependent on CourseID
.
To achieve 3NF in such a scenario, we'd further split the Courses
table.
Benefits of Normalization
- Reduced Redundancy: Saves storage space and prevents inconsistencies.
- Improved Data Integrity: Updates, insertions, and deletions are simpler and less error-prone.
- Easier Maintenance: Changes to the database schema are less impactful.
- More Flexible Design: Easier to extend the database with new features.
While normalization is crucial for good database design, in some high-performance scenarios, a degree of de-normalization might be considered for read-heavy workloads to optimize query performance, but this should be a conscious decision made after understanding the trade-offs.