Define Normalization, list its types.
Of course. Here is a comprehensive definition of Normalization, followed by a list and explanation of its most common types (normal forms).
Normalization is the systematic process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and improve data integrity. The primary goal is to structure the database in a way that eliminates undesirable characteristics like Insertion, Update, and Deletion Anomalies.
In simple terms, normalization is like organizing a messy closet. You start with one big pile of clothes (an unnormalized table) and methodically sort them into smaller, well-organized drawers and sections (normalized tables), where each item has a specific and logical place. This ensures you don't have duplicate items and makes it easier to add, update, or remove items without creating a mess.
The process involves applying a series of rules or tests, known as Normal Forms (NF), to the database schema.
The main objectives of Normalization are:
Eliminate Data Redundancy: Storing the same piece of data in multiple places is avoided.
Improve Data Integrity: By reducing redundancy, the risk of data inconsistency (where different copies of the data have different values) is greatly reduced.
* Prevent Data Anomalies: It helps avoid problems that can occur when adding, modifying, or deleting data.
* **Insertion Anomaly:** Inability to add data because some other data is missing.
* **Update Anomaly:** A single data change requires updating multiple rows.
* **Deletion Anomaly:** Deleting one piece of data unintentionally causes other, unrelated data to be lost.
Normalization is a sequential process. To be in a higher normal form, a table must first satisfy all the rules of the lower normal forms. The most common forms are 1NF, 2NF, 3NF, and BCNF.
Example:
* Not in 1NF (Bad): A Courses
table where the Students
column contains a comma-separated list.
| CourseID | CourseName | Students |
| :--- | :--- | :--- |
| CS101 | Intro to C++ | "Alice, Bob, Charlie" |
CourseName
, which 2NF will fix.)Example:
* Not in 2NF (Bad): Consider a table where the primary key is (StudentID, CourseID)
.
| StudentID | CourseID | **StudentName** | Grade |
| :--- | :--- | :--- | :--- |
| 101 | CS101 | **Alice** | A |
| 101 | MTH203 | **Alice** | B |
| 102 | CS101 | **Bob** | B |
Here, `StudentName` depends only on `StudentID`, which is just a part of the primary key. This is a **partial dependency**. This causes redundancy (Alice's name is repeated).
In 2NF (Good): We split the table to remove the partial dependency.
Students
Table:
| StudentID | StudentName |
| :--- | :--- |
| 101 | Alice |
| 102 | Bob |
Enrollment
Table:
| StudentID | CourseID | Grade |
| :--- | :--- | :--- |
| 101 | CS101 | A |
| 101 | MTH203 | B |
| 102 | CS101 | B |
A -> B -> C
, where A is the primary key and B and C are non-key attributes).Example:
* Not in 3NF (Bad): Consider a Students
table where Major
determines the HeadOfDepartment
.
| StudentID (PK) | StudentName | Major | **HeadOfDepartment** |
| :--- | :--- | :--- | :--- |
| 101 | Alice | Computer Science| **Dr. Smith** |
| 102 | Bob | Biology | **Dr. Jones** |
| 103 | Charlie | Computer Science| **Dr. Smith** |
Here, `HeadOfDepartment` depends on `Major`, and `Major` depends on the primary key `StudentID`. This is a transitive dependency. It causes redundancy (Dr. Smith's name is repeated for every CS student).
In 3NF (Good): We split the table to remove the transitive dependency.
Students
Table:
| StudentID | StudentName | Major |
| :--- | :--- | :--- |
| 101 | Alice | Computer Science|
| 102 | Bob | Biology |
| 103 | Charlie | Computer Science|
Majors
Table:
| Major (PK) | HeadOfDepartment |
| :--- | :--- |
| Computer Science | Dr. Smith |
| Biology | Dr. Jones |
X -> Y
, X must be a superkey.There are also higher normal forms, but they are less common in practical database design:
Fourth Normal Form (4NF): Deals with multi-valued dependencies.
Fifth Normal Form (5NF): Deals with join dependencies.
* Sixth Normal Form (6NF): Deals with temporal data (related to time).
For the vast majority of real-world applications, achieving 3NF is considered the standard for a well-designed relational database.