Data Normalization
Data normalization is a database design technique used to organize data in a database to reduce data redundancy and improve data integrity. It involves structuring tables in a way that each piece of data is stored in only one place, preventing inconsistencies and simplifying data management.
Why Normalize?
- Reduce Redundancy: Storing the same information multiple times leads to wasted storage space and potential inconsistencies.
- Improve Data Integrity: By eliminating redundancy, changes to data only need to be made in one location, ensuring accuracy.
- Simplify Updates and Deletions: Operations become more straightforward when data is not duplicated across multiple records.
- Make Database More Flexible: A normalized database is easier to extend and modify as requirements change.
The Normal Forms
Normalization is achieved through a series of guidelines called normal forms. The most commonly used are the first three normal forms (1NF, 2NF, 3NF).
First Normal Form (1NF)
A table is in 1NF if:
- Each column contains atomic (indivisible) values.
- Each row is unique.
- There are no repeating groups of columns.
Example of a non-1NF table:
| OrderID | CustomerName | Products |
|---|---|---|
| 101 | Alice Smith | Laptop, Mouse, Keyboard |
| 102 | Bob Johnson | Monitor |
In this example, the Products column contains multiple values. To achieve 1NF, we would typically split this into multiple rows or a separate related table.
Second Normal Form (2NF)
A table is in 2NF if it is in 1NF and:
- All non-key attributes are fully functionally dependent on the entire primary key. This primarily applies to tables with composite primary keys.
Consider a table storing order items. If the primary key is a composite of OrderID and ProductID, any attribute that depends only on ProductID (like ProductName) would violate 2NF if placed in this table.
Example: If we have an OrderItems table with OrderID, ProductID, Quantity, ProductName, and ProductPrice.
Quantitydepends on bothOrderIDandProductID.ProductNameandProductPricedepend only onProductID.
To achieve 2NF, we would split this into two tables:
- Orders Table:
OrderID(PK),OrderDate,CustomerID - OrderItems Table:
OrderID(PK, FK),ProductID(PK, FK),Quantity - Products Table:
ProductID(PK),ProductName,ProductPrice
Third Normal Form (3NF)
A table is in 3NF if it is in 2NF and:
- There are no transitive dependencies. A transitive dependency exists when a non-key attribute depends on another non-key attribute.
Example: Consider a Customers table with CustomerID (PK), CustomerName, City, and State.
If the State is determined by the City (e.g., all cities in 'CA' are in California), then State is transitively dependent on CustomerID through City.
To achieve 3NF, we would split this into two tables:
- Customers Table:
CustomerID(PK),CustomerName,CityID(FK) - Cities Table:
CityID(PK),CityName,State
Glossary
- Atomic Value
- A value that cannot be further divided into smaller, meaningful parts.
- Data Integrity
- The accuracy, consistency, and reliability of data stored in a database.
- Data Redundancy
- The unnecessary duplication of data in a database.
- Primary Key
- A column or set of columns that uniquely identifies each row in a table.
- Foreign Key
- A column or set of columns in one table that refers to the primary key in another table, establishing a link between them.
- Composite Primary Key
- A primary key that consists of two or more columns.
- Functional Dependency
- A relationship between two sets of attributes in a table, where one set uniquely determines the other.
- Transitive Dependency
- A dependency where a non-key attribute depends on another non-key attribute, which in turn depends on the primary key.