What is Collation In MySQL?

Learn via video courses
Topics Covered

What is Collation in MySQL?

Collation in MySQL refers to a set of rules that defines the criteria as to how character data can be sorted and compared in a particular character set. Moreover, it takes into reference the character's set language and cultural conventions.

Collations can be set at different levels in MySQL that we'll discuss later in this article. For the time being, it's important to know that the collation in MySQL plays a significant role in data storage. The incorrect use of collation will eventually result in a string operation with an inaccurate and inefficient sorting and searching operation.

For example, if the collation of a database is set to "utf8_general_ci", it would use case-insensitive comparison and sort for strings in the UTF-8 character set.

MySQL Collation

Now that we have perceived the purpose of collation in MySQL from the technical aspect, let's take up a different approach to comprehend it as a newbie.

Every language comprises one or more conventions that we need to follow as we get acquainted with it. And it's all correlated with the technical world as well. For instance, there are accented and unaccented characters in some of the languages around the world. Accented characters like é has a distinct identity as compared to their unaccented counterparts. Therefore, this consideration also leads us to have a logical or alphabetical order to sort the data.

This is where collation comes into the picture. It is a set of rules that defines how the character data should be compared and sorted, taking into account the specific rules of a language. There are many different collations available in MySQL, each with its own set of rules.

As we discussed in the above example, utf8_general_ci is a collation that is used for sorting as well as comparing text in the UTF-8 character set. Contrarily, utf8_bin also sorts and compares the text in the UTF-8 character set. What differentiates them both is their comparison criteria.

utf8_general_ci carries out case-insensitive comparison where both the uppercase as well as lowercase characters will be considered equivalent. Moreover, even the accented characters have the same emphasis as the unaccented characters, meaning they both are treated the same.

utf8_bin has a binary comparison method where the uppercase and accented characters will be differentiated from the lowercase and unaccented characters.

You can also get the list of all the character sets with their corresponding default collation available in MySQL.

Output:

mysql collation

Notice that each of the collation names starts with the character set's name and ends with the abbreviations like ci, cs, or bin appended with a _. ci stands for case-insensitive and cs stands for case-sensitive data while bin enables comparison of the string data byte-by-byte.

Setting Character Set and Collation

Collations that are based on the Unicode Collation Algorithm(UCA) use the weight_string() function to sort and compare characters based on their Unicode values. This function returns a binary string as an indication of the weight of the characters. To show the hexadecimal representation, we can use HEX(WEIGHT_STRING(str)).

We can also set a collation for the same in this manner:

Setting at the Server Level

Setting the collation at the server level in MySQL means that the default collation will be used for all databases, tables, and columns unless it is explicitly overridden at a lower level.

To set the collation at the server level, you can modify the server configuration file "my.cnf" or "my.ini" depending on your operating system.

We have the privilege of setting only the character set or both the character set as well as collation for that specific set. Before server start-up, if we have only set the character set, then MySQL will be using the default collation for the same character set. You can do so by using the command-line tool.

If we set both the character set and collation explicitly at server start-up, MySQL will use those settings for all databases created on that MySQL server.

The above query would set the default character set to utf8mb4 and the default collation to utf8mb4_general_ci for all the databases, tables, and columns on the MySQL server.

Setting at the Database Level

Although it isn't required to set the character set and collation at the database level as it uses the default configuration of the server, it can still be modified at the database level in the CREATE DATABASE as well as the ALTER DATABASE statement.

While creating a database, we can set a character set along with collation in the following manner:

Here's an example:

And if you have already created a table, then here's how you can use ALTER TABLE statement to set the character set and collation.

Here's an example:

When the collation is set at the database level, it will override the default collation for all the tables and columns in that particular database.

Setting at the Table Level

There can also be specific character sets and collations for the columns in a table if we are willing to set the collation at the table level.

Setting the collation in MySQL at the table level will comply with all the columns in a table and the default collation for all of them will be overridden to the specified collation.

Let's create a table of users with the columns id, name, and email with the character set and collation.

Alternatively, we can also modify an existing table with the help of the ALTER TABLE statement. We can create the above table without specifying the default character set and collation initially.

Now we can change the character set and collation with the ALTER TABLE statement.

Setting at the Column Level

Ultimately, the character set and collation can be defined at the molecular level, i.e., specifically for the columns in a table. The column in a database table comprises different data types such as TEXT, NUMBER, CHAR, VARCHAR, etc. And each of the columns can have its own character set and collation defined.

But, what's the significance of setting the character set and collation at the column level?

We already discussed that collation in MySQL contains the rules for sorting and searching data in a specific character set. And when we set it at the column level, we can control how the data should be sorted and searched. As an instance, you may wish to have a case-insensitive column that includes all the names of the users, so that searching for ABC and abc would fetch the same result.

Also, if you already know that a column has ASCII characters, then you can use a character set that is more optimized for the ASCII data specifically to improve the query performance.

Here's how you can set collation in MySQL at the column level:

Example

We shall begin with fetching the collation names of our choice. To do so, we can use the LIKE operator with the SHOW COLLATION statement.

Output:

Example for collation

Consider an example where we can have a table for book reviews and there can be different languages for reviews. So, for each language, we can have different collations to define the rules for sorting and comparison.

As in our case, the character set utf8mb4 will be used at the column level but we can have different collations internally for each language the reviews can be written in.

Conclusion

  • Collation in MySQL refers to a set of rules that defines the criteria as to how character data can be sorted and compared in a particular character set.
  • Setting the collation at the server level in MySQL means that the default collation will be used for all databases, tables, and columns unless it is explicitly overridden at a lower level.
  • To set the collation at the server level, you can modify the server configuration file my.cnf or my.ini depending on your operating system.
  • The character set and collation can be modified at the database level in the CREATE DATABASE as well as the ALTER DATABASE statement.
  • Setting the collation in MySQL at the table level will comply with all the columns in a table and the default collation for all of them will be overridden to the specified collation.