Count Frequency Encoding

Introduction

The first method that is mostly used for Categorical Variable Encoding is "Count Frequency Encoding". This method is used to replace the categorical variable either with their count of values or the percentage share of the value in total space.

Let's see an example to understand it better

Dummy Count Frequency Encoding

Here we have created dummy data of 6 car companies and the colour of most selling cars on the left-hand side. While on the right-hand side we can see the list of the same cars but the Categorical Variable, i.e colour has been encoded using the Count Frequency Encoder, by both Count and Percentage.

Since there were 2 companies, Tata and Jaguar having Grey as the most sold colour. Therefore, when encoding using count they both got the value 2, denoting that their value was repeated twice in the dataset and both had the same value.

Some Important Points

While we go ahead and perform Categorical Variable Encoding using Count Frequency Encoding, there are a few points that one should keep in mind:-

Before using this technique, we need to divide the dataset into train and test sets.
Train this technique only over the train set.
Using this trained model, encode the values from both train and test sets.
This technique can be used for both Numerical and Categorical fields.
In case, if some values are missing in the train set at the time of training the model and encountered in the test set, it will give an error for such values.

Advantages

This technique is quite simple to implement.
It does not expand the feature space.
Can be used for both Numerical and Categorical fields.

Disadvantages

In case two different values in a category appear the same number of times then both will be replaced by the same count.
Replacing values with the same count may diminish the importance of variables.

Disadvantage of Count Frequency Encoding

Disadvantage Count Frequency Encoding

Practical

We will be using the feature-engine library of python for demo purposes.

1. Importing the Libraries

Importing Count Frequency Encoder & Data

2. Viewing the Data

Dataset preview

3. Initializing the Count Frequency Encoder

Initializing the Count Frequency Encoder

Here we have used "encoding_method" as "count", if we want to replace it with the frequency we can use 'frequency' in place of 'count'.

4. Transforming using Count Frequency Encoder

Count Encoder transform

We can notice here that there was a total of 577 males & 314 females. Now, these values will be used for encoding.

5. Verifying Data

Verifying Data

Resources

Please comment below to get the complete dataset and libraries.

Learn to install Anaconda Here.

Learn about the Feature-Engine library Here.

Summary

In this Quick Reads, we studied a technique, Count Frequency Encoding that is commonly used for Categorical Variable Encoding. We had a quick overview of the technique, saw the positives and negatives of this technique and some quick Points to Remember.

We also performed a practical demo of this technique using a famous python library "feature-engine".

Last but not least... "Practice Makes ~~a Man~~ One Perfect". So what are you waiting for practice this technique and comment below your views, doubts or anything? We are here to help you.

QuickDataScience | Quick & Easy Data Science

Search This Blog