Skip to main content

One Hot Encoding

 



Introduction

One of the most famous, most talked and most common methods when it comes to categorical variable encoding is "One Hot Encoding". We all have seen or heard this method somewhere in our DS journey till now. Also, often this method is shown in many Data Science or Machine Learning videos. 

So, what makes this technique so special that everyone likes it..!!! 

One Hot Encoding is defined as encoding each categorical variable with a different binary variable, i.e. 1 & 0 only. Such that the value 1 is used to represent if the value is present and 0 to represent if it's missing. Here, the number of distinct values in the variables that many new columns are added to indicate if that value is present or not. 

Let's have an example to understand it better 

Dummy One Hot Encoding
Dummy One Hot Encoding

Some Important Points

While we go ahead and perform Categorical Variable Encoding using One Hot Encoding, there are a few points that one should keep in mind:-

  • Before using this technique, we need to divide the dataset into train and test sets. 
  • Train this technique only over the train set.
  • Using this trained model, encode the values from both train and test sets.
  • This technique can be used for both Numerical and Categorical fields. 
  • In case, if some values are missing in the train set at the time of training the model and encountered in the test set, it will give an error for such values.

Advantages

  • Easy & Straightforward to implement.
  • Does not make any assumption about distribution or categories of the categorical variable.
  • All the information about the categorical variables is kept intact. 
  • Best suitable for linear models.

Disadvantages

  • It expands the feature space.
  • No extra information is added while encoding.
  • Many dummy variables may be identical, introducing redundant information.

A Hidden Concept

One major concept in One Hot Encoding that we would like to explain is How Encoding is done here. To understand it better we would like to go back to our "Car's" dummy data and explain it. 


 
In the above dummy example, we can see there are seven brands of cars(7 rows), but after encoding, we could only see 6 columns on the right-hand side, i.e N-1 columns. Where N is the number of rows. 

This is so because if we look closely every car has been assigned a unique binary value on the right side. So adding a new column for 'Jaguar' is not required. 

Each column is marked as '1' for the corresponding row value. Example Ford is represented as 001000, Tata as 100000, and so on... Since Tata was the first row so it has the first value as 1 and rests 0. Similarly, Ford was third so it has a third value of 1 and rests 0. So, since Jaguar is the last value it can also be represented as 000000, meaning it is none of the previous 6 brands and represents some other car brand.

So, in short, it can be said for One Hot Encoding we use only N-1 distinct variables for binary representation.

Practical

We will be using the feature-engine library of python for demo purposes. We can also use the pandas 'getdummies' method for the same.

1. Importing the Libraries

Importing One Hot Encoder, libraries and data
Importing One Hot Encoder, libraries and data


2. Viewing the Data

Dataset Preview
Dataset Preview


3. Initializing the One Hot Encoder

Initializing One Hot Encoder
Initializing One Hot Encoder


4. Transforming using One Hot Encoder 


One Hot Encoder transform
One Hot Encoder transform


5. Verifying Data

One Hot Encoded 'Sex' Variable
One Hot Encoded 'Sex' Variable


Here we can notice that the 'Sex' column has been encoded into two columns 'Sex_males' & 'Sex_female', this is because by default OneHotEncodergenerate binary for all the values and to drop one of the columns we need to specify it at the time of initializing OneHotEncoder.

Resources


Please comment below to get the complete dataset and libraries. 

Learn to install Anaconda Here.


Summary 


In this Quick Reads, we studied a technique, One Hot Encoding that is commonly used for Categorical Variable Encoding. We had a quick overview of the technique, saw the positives and negatives of this technique and some quick Points to Remember. 

We also performed a practical demo of this technique using a famous python library "feature-engine". 

 Last but not least... "Practice Makes One Perfect". So what are you waiting for practice this technique and comment below your views, doubts or anything? We are here to help you. 

Comments