Ordinal Encoding
Overview
Ordinal encoding is a technique to transform categorical features into a numerical format. In ordinal encoding, labels are translated to numbers based on their ordinal relationship to one another. For example, if one feature contains - {low, medium, high}, it can be converted into {1,2,3}, where 1 represents low, 2 represents medium, and 3 represents high. It is one of the essential tasks before training an ML model, as many ML algorithms do not support categorical data directly and require them to be converted into a numerical format.
Introduction to Categorical Data
Before getting into ordinal encoding, it is necessary to understand what categorical data is and the different kinds of categorical features.
A real-world dataset in any Data Science project generally consists of numerical and categorical features. Numerical features can contain only numbers, i.e., integers or decimals. Categorical data is another data type that can take or hold only a limited and fixed number of values. These values can represent categories, groups, or labels associated with the data. Categorical features are often represented using words or strings rather than numbers. A few examples of categorical features include -
- An animal variable with values of dog, cat, and bird
- A country variable with values India, USA, and Germany
- A product variable with values Samsung, Apple, and LG
Further, categorical features/variables can be divided into two categories as described below -
- Ordinal Categorical Variable - An ordinal categorical variable is a type of categorical variable in which the categories can be ordered or ranked. In other words, categories in ordinal categorical variables have clear, natural, and intrinsic ordering to their categories. A few examples of ordinal variables are economic status (low income, middle income, high income), educational experience (high school, bachelor's, master’s), customer feedback ratings (strongly dislike, dislike, neutral, like, strongly like), etc.
- Nominal Categorical Variable - In nominal categorical variables, categories have no relationship with each other. For example, age (male, female, transgender), colors (blue, red, green, yellow), blood group (A+, B+, O+, O-), etc.
What is Encoding?
- The encoding of a categorical variable can be defined as the process of transforming the categorical variables into a numerical format. This is often necessary before training ML models, as most machine learning and deep learning algorithms require data to be in a numerical format.
- A few of the most common techniques to encode categorical variables include - ordinal encoding, one-hot encoding, and binary encoding. The choice of encoding technique will depend on certain characteristics of the categorical variable. For example, one hot encoding is used to encode nominal variables, and ordinal encoding is used to encode ordinal variables.
What is Ordinal Encoding
- Ordinal encoding is a technique that is used to transform categorical variables into a numerical format by assigning a unique value to each of its categories. It is also referred to as Label Encoding. For example, we have customer feedback data based on a survey or online feedback mechanism. It contains categories - very dissatisfied, dissatisfied, neutral, satisfied, and very satisfied. To encode this variable using ordinal encoding, we can assign numerical values as mentioned below -
- very dissatisfied - 1
- dissatisfied - 2
- neutral - 3
- satisfied - 4
- very satisfied - 5
- Ordinal encoding assumes that categories in categorical variables have clear, natural, and intrinsic ordering to their categories. It does not work for nominal categorical variables as no relationship exists between categories of a nominal variable. In our previous example, we encoded the categorical variable by assigning the lowest numerical value of 1 to the very dissatisfied category and the highest value of 5 to the very satisfied category. This way, we were able to preserve the natural ordering of the categories - very dissatisfied < dissatisfied < neutral < satisfied < very satisfied was retained in 1 < 2 < 3 < 4 < 5. Suppose we have another categorical variable, which contains red, blue, and green categories. We can encode this variable using ordinal encoding by assigning 1 to red, 2 to blue, and 3 to green, but it may lead to incorrect results. As encoded values have a natural ordering between them - 1 < 2 < 3 will be there, but red < blue < green does not exist.
Example: Encoding Categorical Data using Ordinal Encoding
Let’s understand how you can apply ordinal encoding to categorical features using Python libraries. We will use the OrdinalEncoder class provided by the sklearn library.
Conclusion
- In ordinal encoding, categorical variables are transformed into numerical variables by assigning unique numbers to their categories based on their ordinal relationship to one another.
- Ordinal encoding is only suitable for ordinal categorical features and can lead to incorrect results for nominal variables where no relationship exists between its categories.