Categorical Data in Machine Learning

Overview

One of the most important aspects of machine learning programs is data preparation. The machine learning model's performance depends on how we process and feed different variables to the model. A key aspect of machine learning libraries is that they accept only the data in numerical form. Therefore, we must encode categorical data as they may contain valuable information.

Introduction

What is Categorical Data?

Categorical variables are mainly in the form of ‘strings’ or ‘categories’ and are finite in number. For example, types of fish, cars, blood groups, places, etc. There are two types of categorical data:-

Ordinal: The data that has an inherent order.
Nominal: The data that does not have an inherent order.

While encoding ordinal categorical data, one must take care of the order in which the category is provided. For example, a person's highest degree (such as an MD, Ph.D., High School, etc.) gives valuable information about his qualification. The degree is important in deciding whether a person is suitable for a post.

While encoding nominal data, one must consider the presence or absence of a feature. Since there is no inherent order, all the categories must be equally important. For example, the place where a person lives in. Whether they stay in Delhi or Kolkata, the places should be given equal weightage.

What is Encoding?

Encoding is converting a particular form of data from one form to another to extract valuable information from the data and make the data suitable for our machine learning model.

Encoding Techniques

Label Encoding/ Ordinal Encoding

In this type of encoding, the variables in the data are ordinal and ordinal encoding converts each label into integer values, and the encoded data represents the sequence of labels.

We will use the category_encoders library for this purpose.

Code Implementation

One-hot Encoding

In this encoding, we map each category to a vector that contains 1 and 0, denoting the presence of the feature or not. The number of vectors is equal to the number of categories present.

Code Implementation

Binary Encoding

This technique involves encoding the categories as ordinal, then those integers are converted into binary code, and then the digits from that binary string are split into separate columns.

This technique is proper when there are many categories, and you don't want to increase the number of dimensions as you would do one hot encoding.

Code Implementation

Output:-

Target Encoding

Target encoding is the method of converting a categorical value into the mean of the target variable.

This type of encoding is a type of bayesian encoding method where bayesian encoders use target variables to encode the categorical value.

You will get a better understanding with the example below:-

Output:-

Here you can see the names are replaced by their average marks. This is a good form of encoding but the model trained on this may be prone to overfitting as the encoding focuses on the mean of the values.

Hash Encoding

Hashing involves transforming a string of characters into a usually shorter fixed-length value using an algorithm that represents the original string.

Specifically, we use the MD5 algorithm here. If you set the hyperparameter of the encoder to 5, then irrespective of the length of the string, the encoder will reduce it to a size of 5, which will finally give us 5 different columns representing our categorical value.

Code implementation:-

Output:-

Base-N encoding

In binary encoding, we convert the integers into binary or base 2. BaseN allows us to convert the integers with any value of the base. If you have many categories, you might want to use BaseN.

Code Implementation

Output:-

Effect Encoding

In this encoding type, encoders provide values to the categories in -1,0,1 format. In fact, -1 formation is the only difference between One-Hot encoding and effect encoding.

Code Implementation:-

Frequency Encoding

In this encoding method, we utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat to the target variable, it helps the model understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

Code Implementation

Output:-

Conclusion

The key takeaways from this article are:-

Categorical variables are mainly in the form of ‘strings’ or ‘categories’ and are finite in number.
Two types of categorical data are ordinal and nominal.
There are various encoding techniques such as label, one-hot, baseN, binary, frequency, effect and target.
Out of those, label and one-hot encoding are most commonly used.