Normalization in SQL: 1NF, 2NF, 3NF, and BCNF

Normalization is the process of organizing data in a database while addressing inefficiencies arising from improperly organized tables. Consider a company's database that manages customer information, product details, and sales transactions. Without normalization, for instance, customer details like names and addresses might be duplicated for each purchase, increasing the risk of errors and wasting storage space. Normalization addresses this issue by organizing the database into well-structured tables, eliminating data redundancy.

Note: The article about normalization in SQL requires prior knowledge of important related topics like keys in DBMS, functional dependencies. I’ve discussed a few of them at the bottom of this article.

What is Normalization in SQL?

Normalization in SQL was introduced to address the inefficiencies arising from improperly organized tables. Consider a table with student and teacher data combined, leading to confusion and unstructured information. This approach lacks a proper querying mechanism as using student_id as a key retrieves unwanted teacher data, and vice versa.

Normalization tackles three main anomalies:

Insertion anomalies arise when dependent information is required for data insertion.
Deletion anomalies occur when deleting a record unintentionally removes independent information.
Modification anomalies involve updating information in multiple places.

To resolve these issues, normalization divides large tables into smaller ones, following a set of rules to eliminate redundancy and enhance data organization. In summary, normalization in SQL is a technique that prevents anomalies, ensuring efficient data management and minimizing redundancy by adhering to specific rules.

Types of Normal Forms

We divide the normalization process into a set of normal forms. A normal form is nothing but a form of a table that follows some norms (rules) that prevent the above anomalies to some extent. We have a list of normal forms as discussed below:

Normal Form	Description
1NF	A relation satisfies 1NF if it includes only atomic values.
2NF	A relation achieves 2NF when it is in 1NF, and all non-key attributes are fully functionally dependent on the primary key.
3NF	A relation reaches 3NF if it is in 2NF, and there are no transitive dependencies present.
BCNF	A more stringent form of 3NF, known as BCNF, is attained when a relation is in Boyce Codd's normal form.
4NF	A relation satisfies 4NF if it is in Boyce Codd's normal form and does not exhibit any multi-valued dependencies.
5NF	A relation is in 5NF if it is in 4NF and does not possess any join dependency, ensuring that joining operations are lossless.

Complexity to divide tables into a particular normal form increases as we go down. Also, as we go down, all tables should also follow all the normal forms mentioned above them. Let’s discuss each normal form one by one along with an example.

Note: Article contains the usage of functional dependency occasionally when required. Discussing it here will divert us from the main topic but in short, it’s a way to represent the relationship between the set of attributes of a table.

For a table attributes (A,B,C,D), functional dependency could be like,

A->BC

It says that we can get the values of A, B, and C from attribute A.

First Normal Form (1NF)

It’s the first normal form and almost all tables already follow rules specified in it.

Rules are,

Each cell of a row must have a single value.
Each record should be unique.
Each column should have the same type of data.

Let’s understand them.

The first rule is easy. It says, a cell should have a single value or you can say it as a cell shouldn’t contain composite values. For example, if in an attribute asking for the name of a student, we shouldn’t add 2 or more names in it. Just a single value.

The second rule is related to the concept of keys that identifies each record uniquely. Basically, each record inside a table must have something unique in it because if it’s not then it’s nothing but redundant data (because of duplicacy).

The third rule is quite intuitive. See this with an example.

Name (string)	Age (number)	Portfolio_link (string)

There are 3 attributes in the above table i.e. name, age, and portfolio_link. And we define data type string, number, a string for each corresponding attribute. Now, according to the rule, each value inside a name (string type) attribute must be a string. It can’t have numbers. Similarly, for age (number type), each value must be numerical.

Let’s see a table that doesn’t follow 1NF (though it’s very rare in real-life scenarios).

student_id	Name (string)	Age (number)	portfolio_url (string)
1	A	ABC	www.url.com
2	B	12	www.url1.com www.url2.com
1	A	ABC	www.url.com

Why is it not in 1NF?

The first row has defined ABC (string value) inside an age attribute (number type) – breaking rule 3
The second row defines multiple values in the portfolio_link attribute. – breaking rule 1
The third row is identical to the first row. – breaking rule 2

Let’s see a table that follows 1NF through an example.

Attribute1	Attribute2	Attribute3
1	A	value1
2	B	value2
3	C	value3

This table does follow each rule of 1NF since each cell has a single (atomic) value as well as each record is unique so can be uniquely identified.

Second Normal Form (2NF)

A table is in 2NF when,

It’s in 1NF.
It shouldn’t have a partial dependency.

What is Partial Dependency?

Suppose a relation R, having attributes A, B, C, D. To represent it:

R (A,B,C,D)

functional dependency defined on it is,

AB->D

B->C

From given functional dependencies, we can conclude that AB is a primary key because through AB we can find the value of each attribute.

Therefore, the closure of R (A,B,C,D) is AB.

Now, all the attributes which are part of the candidate key, i.e., A and B are called prime attributes. And all other attributes that are not part of the candidate key are called non-prime attributes.

So, according to 2NF, non-prime attributes should depend on all prime attributes, not on part of the prime attribute. From the above example,

AB->D

B->C

A and B are prime attributes. C and D are non-prime attributes.

D follows the rule because though it’s a non-prime it depends on all prime attributes.

But, C doesn’t follow the rule since it depends on B only (part of the prime attribute) and this is called partial dependency.

So, according to 2NF, a table shouldn’t have a partial dependency.

To make this relation into 2NF,

Make a relation that has prime attributes as well as those non-prime attributes that are totally dependent on prime attributes.
Make a separate relation for attributes that are in partial dependency.

So, to make the above example into 2NF, we do,

R (A,B,D) AB->D
R (B,C) B->C

Third Normal Form (3NF)

A table is in 3NF,

It must be in 2NF.
It shouldn’t have a transitive dependency.

What is Transitive Dependency? Let’s take a relation, R (A,B,C) with functional dependencies defined as,

A->B

B->C

From the given dependencies, we can conclude that A is the primary key because from A we can get both B and C attributes (from A we get B and from B we get C). Therefore, the closure of R (A,B,C) for given dependencies is A.

Now, A is a prime attribute, and B and C are non-primes.

It’s in 2NF because there is no non-prime attribute that is partially dependent on part of the primary key since B (non-prime) is dependent on A (primary key) and C (non-prime) is dependent on B (non-prime). Also, if a given relation has only a single attribute as a primary key then partial dependency is not even possible because it’s single. Got the point?

But here, a non-prime attribute C is dependent on non-prime attribute B. And, this is called transitive dependency when a non-prime attribute is dependent on another non-prime.

To make this relation into 3NF,

Make a separate table of prime attributes and all other attributes directly dependent on it.
Make a separate table for non-prime attributes that are dependent on other non-primes and make determining non-prime as the primary key in this new table.

So, to make above example into 3NF, we do,

R (A,B) A->B
R (B,C) B->C

Let’s see this relation, R (A,B,C,D)

A->B

B->C

C->D

Closure of this relation is A. Why?

Now, is it in a 2NF? Yes, because there is no partial dependency.

Is it in a 3NF? No, because there are 2 non-primes that are dependent on other non-primes.

To make it into 3NF,

R (A,B)
R (B,C)
R (C,D)

Boyce-Codd Normal Form (BCNF)

Boyce-Codd's normal form comes into the picture because of some anomalies that are not handled by above mentioned normal forms.

If we notice, 2NF and 3NF both deal with non-prime attributes. If there is a functional dependency from A->B then whatever the A is B is always non-prime. They both didn’t deal with situations when B is prime.

So, that’s why BCNF.

A table is in BCNF,

It must be in 3NF
For any functional dependency A -> B, A must be a super key.

Let’s take a relation, R (A,B,C)

AB->C

C->B

Here, we can conclude AB and AC are 2 candidate keys.

Is it in 2NF?

AB -> C is not a partial dependency because AB is prime and C is also a prime attribute.

Similarly, C -> B is not a partial dependency. Though C (part of prime attribute) is determining an attribute but since B is also prime so partial dependency doesn’t exist here. If you remember, partial dependency exists when there is a dependency from X->Y and X is prime and Y is non-prime. Here dependency is from prime to prime not prime to non-prime.

Is it in 3NF?

According to transitive dependency, a non-prime attribute is dependent on other non-prime but here it’s not the case. So, it’s definitely in 3NF.

But what is the problem? Now, we’re determining even prime attributes. But that shouldn’t happen. Ideally, prime attributes should be the one who determines others. Right? Because that’s why we make keys to uniquely identify the record. That’s what they’re capable of.

So, is the above relation in BCNF?

AB -> C holds true because AB is a candidate key and also super key.

C -> B doesn’t hold because C is a prime attribute which is determining another prime attribute B, and C is not a super key as well. Therefore, the given relation is not in BCNF.

To make it into BCNF

R (C,B) C->B
R (A,C) AC->AC

This decomposition might be confusing at first but it’s logical. Let’s see how.

See, whenever we decompose a table into smaller tables then we’ve to do it in a way that we are able to get the original table once we join these smaller tables again. This is called lossless decomposition.

So, for the above example, if we make tables,

R (C, B) C -> B
R (A,B) AB -> AB

We can’t have A,B,C in a single table because that’s what we’re solving. A separate table for C -> B (C being a super key) is clear because according to BCNF any functional dependency from X -> Y, X should be a super key.

Now, there is confusion between R (A,C) and R (A,B). Why did we choose R (A,C)? It’s because of lossless decomposition. If we choose R (A,B) then how do we connect R (A,B) and R (C,B)? To combine them, there must be one common prime attribute which is a key because then only we are able to relate the two. Isn’t it? In R (A,B) AB is the key and in R (C,B) C is the key (no common prime attribute found).

But in R (A,C) AC is the key and in R (C,B) C is the key. As you can see, C is the prime attribute which is common and also super key in the other table. So, that’s why we chose R (A,C) over R (A,B).

Note: Lossless decomposition is a separate topic altogether. We just discussed the glimpse of it here. Please have a look at it on a separate blog.

Fourth Normal Form (4NF)

A table is in 4NF,

It must be in BCNF.
It shouldn’t have a multivalued dependency.

What is Multivalued Dependency? Let’s assume a table Person where each person has 2 mobile numbers and 2 email_ids. How can we store them?

Name	Mobile_number	Email_Id
A	99XXXXXXXX	aa@aa.com
A	88XXXXXXXX	a@a.com
A	99XXXXXXXX	a@a.com
A	88XXXXXXXX	aa@aa.com

We can’t store multiple values in a single cell (condition of 1NF). So, this is the only possible way. But try to see the problem here. Mobile_number and Email_id don’t have anything related. They’re independent entities, yet relevant enough to describe an entity (person). We can’t omit them but storing them produces tuples that are not required. And that’s why this representation has a multivalued dependency. So, to avoid these issues we use 4NF.

To make it into 4NF,

R (Name, Mobile_number)
R (Name, Email_id)

Name	Mobile_number
A	99XXXXXXXX
A	88XXXXXXXX

Name	Email_id
A	aa@aa.com
A	aa@aa.com

Now, both the relations will have only 2 tuples. And, using this we maintained all the information and prevented redundant rows.

Fifth Normal Form (5NF)

A table is in 5NF,

It must be in 4NF
It shouldn’t have join dependency

What is Join Dependency? The concept is similar to lossless decomposition. If we divide a relation into smaller tables and combine them again then we must get the original table. To ensure it, we have to find a common attribute that is prime as well as candidate key in the other table.

So, according to 5NF, if we are able to divide a table further into a smaller table and still the information is lossless then it’s not in 5NF.

Basically, 5NF is the last level and any further decomposition will cause data loss.

Sixth Normal Form (6NF)

6NF is yet to be discussed completely by experienced professionals. It’s proposed by experts and hopefully will be implemented in the near future. Until then 1NF to 5NF is the only major normal form to read and learn.

Conclusion

So, in this article, we learned what normalization is, why it’s introduced and how it prevents all the anomalies discussed, i.e., insertion, deletion and modification anomalies. We also discussed all the types of normal forms from 1NF to 6NF and their rules set. We also learned to decompose a relation into a particular form with examples.