Introduction
The first method that is mostly used for Categorical Variable Encoding is "Count Frequency Encoding". This method is used to replace the categorical variable either with their count of values or the percentage share of the value in total space.
Let's see an example to understand it better
|
Dummy Count Frequency Encoding |
Here we have created dummy data of 6 car companies and the colour of most selling cars on the left-hand side. While on the right-hand side we can see the list of the same cars but the Categorical Variable, i.e colour has been encoded using the Count Frequency Encoder, by both Count and Percentage.
Since there were 2 companies, Tata and Jaguar having Grey as the most sold colour. Therefore, when encoding using count they both got the value 2, denoting that their value was repeated twice in the dataset and both had the same value.
Some Important Points
While we go ahead and perform Categorical Variable Encoding using Count Frequency Encoding, there are a few points that one should keep in mind:-
- Before using this technique, we need to divide the dataset into train and test sets.
- Train this technique only over the train set.
- Using this trained model, encode the values from both train and test sets.
- This technique can be used for both Numerical and Categorical fields.
- In case, if some values are missing in the train set at the time of training the model and encountered in the test set, it will give an error for such values.
Advantages
- This technique is quite simple to implement.
- It does not expand the feature space.
- Can be used for both Numerical and Categorical fields.
Disadvantages
- In case two different values in a category appear the same number of times then both will be replaced by the same count.
- Replacing values with the same count may diminish the importance of variables.
|
Disadvantage Count Frequency Encoding |
Practical
We will be using the feature-engine library of python for demo purposes.
1. Importing the Libraries
|
Importing Count Frequency Encoder & Data |
2. Viewing the Data
|
Dataset preview |
3. Initializing the Count Frequency Encoder
|
Initializing the Count Frequency Encoder |
Here we have used "encoding_method" as "count", if we want to replace it with the frequency we can use 'frequency' in place of 'count'.
4. Transforming using Count Frequency Encoder
|
Count Encoder transform |
We can notice here that there was a total of 577 males & 314 females. Now, these values will be used for encoding.
5. Verifying Data
|
Verifying Data |
Resources
Please comment below to get the complete dataset and libraries.
Summary
In this Quick Reads, we studied a technique, Count Frequency Encoding that is commonly used for Categorical Variable Encoding. We had a quick overview of the technique, saw the positives and negatives of this technique and some quick Points to Remember.
We also performed a practical demo of this technique using a famous python library "feature-engine".
Last but not least... "Practice Makes a Man One Perfect". So what are you waiting for practice this technique and comment below your views, doubts or anything? We are here to help you.
Comments
Post a Comment