Skip to main content

Mean Encoding or Target Encoding

 



Introduction 


A technique that is most commonly used anywhere and everywhere is the 'Mean'. The first thing that comes to mind of a Data Scientist on seeing huge data is "Calculate the Mean". So, why not use the same technique here also and try to encode our categorical variables using the Mean. 

This technique of encoding the categorical variable with the Mean is known as "Mean Encoding" or "Target Encoding". 

This technique is known as Target Encoding because the mean of a value in a variable is calculated based on the Target Values. Let's have an example to understand it better... 

Suppose, we have a variable of cars and another variable containing the mileage of the cars. So, if a car from Tata has a mileage of 50 then its value is encoded with 0.5, another car from Honda having a mileage of 30 will be assigned/encoded with 0.3. 

Dummy Mean Encoding
Dummy Mean Encoding


Some Important Points


While we go ahead and perform Categorical Variable Encoding using Mean Encoding, there are a few points that one should keep in mind:-

  • Before using this technique, we need to divide the dataset into train and test sets. 
  • Train this technique only over the train set.
  • Using this trained model, encode the values from both train and test sets.
  • This technique can be used for both Numerical and Categorical fields. 
  • In case, if some values are missing in the train set at the time of training the model and encountered in the test set, it will give an error for such values.
  • By default, the mean encoder will encode only the Categorical variables.

Advantages

  • Capture information within the category, therefore creating more predictive features
  • Create a monotonic relationship between the variable and the target, therefore suitable for linear models
  • It does not expand the feature space.
  • This technique is quite simple to implement.
  • Can be used for both Numerical and Categorical fields.

Disadvantages

  • Prone to cause over-fitting
  • Difficult to cross-validate with current libraries.

Practical

We will be using the feature-engine library of python for demo purposes.

1. Importing the Libraries

Importing Mean Encoder, libraries and data
Importing Mean Encoder, libraries and data


2. Viewing the Data

Dataset Preview
Dataset Preview


3. Initializing the Mean Encoder

Initializing Mean Encoder
Initializing Mean Encoder


Here, while fitting the data(training the data) we need to specify the target variable also. 

4. Transforming using Mean Encoder 

Mean Encoder transform
Mean Encoder transform


5. Verifying Data

Mean Encoded 'Sex' variable
Mean Encoded 'Sex' variable


Resources


Please comment below to get the complete dataset and libraries. 

Learn to install Anaconda Here.


Summary 


In this Quick Reads, we studied a technique, Mean Encoding that is commonly used for Categorical Variable Encoding. We had a quick overview of the technique, saw the positives and negatives of this technique and some quick Points to Remember. 

We also performed a practical demo of this technique using a famous python library "feature-engine". 

 Last but not least... "Practice Makes One Perfect". So what are you waiting for practice this technique and comment below your views, doubts or anything? We are here to help you. 

Comments