databases
How to Normalize Databases (I)
author
Luis Paredes
published
Jun 29, 2023
Database normalization is a crucial concept that every full-stack or backend web developer should be familiar with. It provides a framework for organizing and structuring data efficiently, leading to improved data integrity and system performance.
In this two-part blog series, we will explore the fundamentals of database normalization, its benefits, and the various normal forms that define progressive levels of data organization.
By understanding the principles of database normalization, developers can design and maintain well-structured databases that support robust and scalable applications. Let's dive in!
What is normalization and why does it matter?
Database normalization is the process of structuring a database in a way that minimizes data redundancy and dependency issues. It aims to eliminate anomalies and inconsistencies in data by organizing it into logical and efficient structures.
The primary objective of normalization is to ensure data integrity, improve database performance, and simplify data management and maintenance. By following specific rules and guidelines, known as normal forms, developers can eliminate data duplication and establish relationships between tables.
Normalization helps in achieving efficient data storage and retrieval, reducing data anomalies such as update anomalies, insertion anomalies, and deletion anomalies. It also facilitates scalability and flexibility, allowing for easier modifications and expansions of the database structure.
The normal forms
The normal forms serve as guidelines for structuring databases in a normalized and efficient manner. They provide a progressive framework (each step builds on top of the previous one) that helps eliminate data redundancy and dependency issues.
In this section, we will explore the different normal forms, including First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Boyce-Codd Normal Form (BCNF), Fourth Normal Form (4NF), and Fifth Normal Form (5NF). Understanding these normal forms is essential for designing robust and well-optimized database schemas.
First Normal Form (1NF)
The First Normal Form (1NF) is the initial step in the normalization process. It focuses on eliminating repeating groups and ensuring that each attribute in a table contains only atomic values of the same type. In 1NF, data is organized in a way that avoids storing multiple values within a single cell, and avoids cells containing different data types if they are in the same column.
For example, consider a "Books" table. In 1NF, instead of having a single column for "Authors" that stores multiple author names separated by commas, we would create a separate table for authors and establish a relationship between the "Books" table and the "Authors" table. This ensures that each author is represented as a distinct value, promoting data integrity and allowing for easier data retrieval and manipulation.
On the other hand, if we consider a "Name" column for the "Author" table, it should only contain names, not numbers, not binary data, nor any other data type.
Second Normal Form (2NF)
In the context of database normalization, a dependency refers to a relationship between attributes within a table. A dependency occurs when the value of one attribute determines the value of another attribute.
Specifically, a partial dependency arises when an attribute depends on only a portion of the table's primary key, rather than the entire key, with the primary key being either a combination of cell values that guarantees unique identification of the record or a single cell that guarantees uniqueness. This means that a non-key attribute is functionally dependent on only a subset of the primary key attributes.
For instance, let's consider a "Student Grades" table with columns for "Student ID," "Course ID," "Course Name," and "Grade." If "Course Name" is functionally dependent on "Course ID" but not on the "Student ID," it represents a partial dependency. This is because "Course Name" does not depend on the entire primary key (which consists of both "Student ID" and "Course ID").
To achieve Second Normal Form (2NF), we would restructure the table by creating separate tables for "Courses" and "Student Grades." The "Courses" table would contain columns for "Course ID" and "Course Name," while the "Student Grades" table would include "Student ID," "Course ID," and "Grade." By doing so, we ensure that non-key attributes ("Course Name") are functionally dependent on the entirety of the primary key (the combination of "Student ID" and "Course ID"), eliminating the partial dependency.
Note that 2NF does not put any restriction on dependencies between attributes that are not part of the primary key set. That is addressed in third normal form.
Third Normal Form (3NF)
Third Normal Form (3NF) is the next level of normalization. It focuses on eliminating transitive dependencies within a table.
A transitive dependency occurs when a non-key attribute depends on another non-key attribute rather than directly on the primary key. In 3NF, all non-key attributes should depend only on the primary key, not on other non-key attributes.
For example, consider a "Students" table with columns for "Student ID," "Student Name," "Course Name," and "Course Instructor." If "Course Instructor" depends on "Course Name" rather than directly on the "Student ID," it represents a transitive dependency.
To achieve 3NF, we would split the table into multiple tables. In this case, we would have a "Students" table with "Student ID" and "Student Name," a "Courses" table with "Course Name" and "Course Instructor," and a junction table, such as "Enrollments," linking the students to their respective courses.
Boyce-Codd Normal Form (BCNF)
Boyce-Codd Normal Form is a stricter version of 3NF, that removes non-trivial dependencies between attributes within a table, i.,e. only functional dependencies out of candidate keys (a minimal set of attributes that can uniquely identify each record in a table) are allowed.
In rare cases, a table in 3NF may not satisfy the requirements of BCNF. However, if a 3NF table does not have multiple candidate keys that overlap or duplicate each other, it can be guaranteed to comply with BCNF.
Let's consider one of such cases:
Dance Room | Start | End | Rate Type |
---|---|---|---|
Yellow Room | 9:30 | 10:30 | Standard Guest |
Yellow Room | 11:00 | 12:00 | Standard Guest |
Yellow Room | 14:00 | 15:00 | Standard Member |
Grey Room | 9:00 | 11:00 | Premium Guest |
Grey Room | 12:00 | 13:00 | Premium Guest |
Grey Room | 14:00 | 15:30 | Premium Member |
We have a table holding the data about the bookings for today in a dance academy. The academy has two dance rooms: a basic "Yellow Room" and a better equipped "Grey Room", and each room has two associated price rates: "Premium Guest" / "Premium Member" for the "Grey Room" and "Standard Member" / "Standard Guest" for the "Yellow Room".
Based on this information, we can uniquely identify each booking using any of these sets (candidate keys):
- {"Dance Room", "Start"}
- {"Dance Room", "End"}
- {"Rate Type", "Start"}
- {"Rate Type", "End"}
On the other hand, we notice that:
- 1NF is observed because the data is atomic and attribute values in each column correspond to the same time
- 2NF is observed because all of the attributes are part of some candidate primary key, i.e. there no non-key attributes and without non-key attributes there cannot exist any partial dependency
- 3NF is also observed because of a similar reasoning as the previous one: since there are no non-key attributes, then there cannot exist any transitive dependency
However, even though the table satisfies 1NF, 2NF, and 3NF, it does not comply with Boyce-Codd Normal Form (BCNF). The presence of a dependency between "Rate Type" and "Dance Room" violates the BCNF requirement because "Rate Type" is not dependent on any of the candidates keys but is instead dependent on the "Dance Room" attribute.
A possible way in which the design can be amended is breaking the table into two:
Rate Type | Dance Room | Member Flag |
---|---|---|
Standard Guest | Yellow Room | No |
Standard Member | Yellow Room | Yes |
Premium Member | Grey Room | Yes |
Premium Guest | Grey Room | No |
Dance Room | Start | End | Member Flag |
---|---|---|---|
Yellow Room | 09:30 | 10:30 | No |
Yellow Room | 11:00 | 12:00 | No |
Yellow Room | 14:00 | 15:30 | Yes |
Grey Room | 10:00 | 11:30 | No |
Grey Room | 11:30 | 13:30 | No |
Grey Room | 15:00 | 16:30 | Yes |
The Rate Types table has two candidate keys: {"Rate Type"} and {"Dance Room", "Member Flag"}. Similarly, the Today's Bookings table has two candidate keys: {"Dance Room", "Start"} and {"Dance Room", "End"}. Both tables satisfy the Boyce-Codd Normal Form (BCNF) because each attribute depends solely on the respective candidate keys, adhering to the principle of depending on the key, the whole key, and nothing but the key.
By utilizing {"Rate Type"} as a key in the Rate Types table, we ensure that each "Rate Type" is associated with a specific combination of "Dance Room" and "Member Flag," preventing any possibility of a single "Rate Type" being linked to multiple combinations. This modification effectively resolves the anomaly present in the original table.
Conclusion
In this first part of our blog series on database normalization, we delved into the fundamental normalization concepts and explored the for most important normal forms.
In the second installment of the series, we'll explore the remaining normal forms that come handy when removing unnecessary redundancy in database tables.