Skip to main content

Rare Label Encoding

 



Introduction

Till now we have seen many techniques for encoding the categorical variables, all having amazing capabilities and performance. But let me put up a question first before diving into another new technique. 

Ques.:-  Suppose we have around 50 different values for a variable, a few having a very high frequency of representation and some with very little representation. Which technique are you going to use for encoding here and Why? 

Please share your answers below in the comment section. Even if you don't know the correct answer, please give it a try. By engaging yourself you will definitely learn more. DO NOT MOVE AHEAD TILL YOU HAVE THOUGHT/COMMENTED ON AN ANSWER. 

So now, continuing to our topic. Rare Label Encoding is a technique used to group values together and assign them under a common "Rare Label" if they have very little representation as compared to the other values.

Let's have an example to understand it better. Suppose we have a dataset of 1000 cars and we need to predict something based on their brands. Out of these 1000 cars, we have 400 cars from Tata, 300 from Maruti, 200 from Honda, 30 from Kia, 30 from MG and 10 each from Land Rover, Jaguar, BMW, Audi. Now, if we see, this dataset does not have equal representation from each brand and it is highly imbalanced, so to avoid this we can club all the brands Kia, MG, Land Rover, Jaguar, BMW, Audi into one group "Other brands". Doing so will we will be creating a new group/brand with 100 cars, with proportional representation. 

Hope that makes the whole technique very clear and now we can get what we are to learn and what we are going to do ahead in practice.


Some Important Points

While we go ahead and perform Categorical Variable Encoding using Ordinal Encoding, there are a few points that one should keep in mind:-


Before using this technique, we need to divide the dataset into train and test sets. 

  • Train this technique only over the train set.
  • Using this trained model, encode the values from both train and test sets.
  • In case, if some values are missing in the train set at the time of training the model and encountered in the test set, it will give an error for such values.

Advantages

  • This technique is quite simple to implement.
  • It does not expand the feature space.

Disadvantages

  • If only on the train set, they may cause over-fitting.
  • If only on the test set, our machine learning model will not know how to score them.

Practical


We will be using the feature-engine library of python for demo purposes.

1. Importing the Libraries


Importing Rare Label Encoder, libraries and data
Importing Rare Label Encoder, libraries and data


2. Data Cleaning & View


Data Cleaning & View
Data Cleaning & View

Since the data had too many NaN or NULL values so we had to clean them before proceeding further. 

3. Initializing the Rare Label Encoder


Initializing the Rare Label Encoder
Initializing the Rare Label Encoder


4. Transforming using Rare Label Encoder


Transforming using Rare Label Encoder
Transforming using Rare Label Encoder

 

5. Verifying Data


Rare Label Encoding Result
Rare Label Encoding Result


Resources


Please comment below to get the complete dataset and libraries. 

Learn to install Anaconda Here.

Summary 


In this Quick Reads, we studied a technique, Rare Label Encoding that is commonly used for Categorical Variable Encoding. We had a quick overview of the technique, saw the positives and negatives of this technique and some quick Points to Remember. 

We also performed a practical demo of this technique using a famous python library "feature-engine". 

 Last but not least... "Practice Makes One Perfect". So what are you waiting for practice this technique and comment below your views, doubts or anything? We are here to help you. 

Comments