Skip to main content

Posts

Showing posts with the label Count Frequency Encoding

Rare Label Encoding

  Introduction Till now we have seen many techniques for encoding the categorical variables, all having amazing capabilities and performance. But let me put up a question first before diving into another new technique.  Ques.:-   Suppose we have around 50 different values for a variable, a few having a very high frequency of representation and some with very little representation. Which technique are you going to use for encoding here and Why?  Please share your answers below in the comment section. Even if you don't know the correct answer, please give it a try. By engaging yourself you will definitely learn more. DO NOT MOVE AHEAD TILL YOU HAVE THOUGHT/COMMENTED ON AN ANSWER.  So now, continuing to our topic. Rare Label Encoding is a technique used to group values together and assign them under a common "Rare Label" if they have very little representation as compared to the other values. Let's have an example to understand it better. Suppose we have a dataset of 100

Ordinal Encoding

  Introduction When we talk encoding, one thing that usually comes to our mind is why can't we simply write down all the values from a variable in a list and assign them values 1,2,3,4..... and so on. Just like we did in our childhood while playing..!!!  The answer is YES..!!! we can do it.. in fact, we will do it... or rather we are going to do it here...  Ordinal Encoding is encoding the categorical variables with ordinal numbers like 1,2,3,4...etc. This way of encoding can be either done by assigning 'Arbitrary' values to the variables or can be based on some value like Mean, or target data.   Arbitrary Ordinal Encoding:- Here the ordinal numbers are allotted randomly to the variables for the encoding. Mean Ordinal Encoding:- Here the ordinal numbers are allotted based on the Target Mean value(Just like we did in Mean/Target Encoding ) to the variables for the encoding.

One Hot Encoding

  Introduction One of the most famous, most talked and most common methods when it comes to categorical variable encoding is "One Hot Encoding". We all have seen or heard this method somewhere in our DS journey till now. Also, often this method is shown in many Data Science or Machine Learning videos.  So, what makes this technique so special that everyone likes it..!!!  One Hot Encoding is defined as encoding each categorical variable with a different binary variable, i.e. 1 & 0 only. Such that the value 1 is used to represent if the value is present and 0 to represent if it's missing. Here, the number of distinct values in the variables that many new columns are added to indicate if that value is present or not.  Let's have an example to understand it better  Dummy One Hot Encoding

Mean Encoding or Target Encoding

  Introduction  A technique that is most commonly used anywhere and everywhere is the 'Mean'. The first thing that comes to mind of a Data Scientist on seeing huge data is "Calculate the Mean". So, why not use the same technique here also and try to encode our categorical variables using the Mean.  This technique of encoding the categorical variable with the Mean is known as "Mean Encoding" or "Target Encoding".  This technique is known as Target Encoding because the mean of a value in a variable is calculated based on the Target Values. Let's have an example to understand it better...  Suppose, we have a variable of cars and another variable containing the mileage of the cars. So, if a car from Tata has a mileage of 50 then its value is encoded with 0.5, another car from Honda having a mileage of 30 will be assigned/encoded with 0.3.  Dummy Mean Encoding

Count Frequency Encoding

Introduction The first method that is mostly used for Categorical Variable Encoding is "Count Frequency Encoding". This method is used to replace the categorical variable either with their count of values or the percentage share of the value in total space.  Let's see an example to understand it better Dummy Count Frequency Encoding Here we have created dummy data of 6 car companies and the colour of most selling cars on the left-hand side. While on the right-hand side we can see the list of the same cars but the Categorical Variable, i.e colour has been encoded using the Count Frequency Encoder, by both Count and Percentage.  Since there were 2 companies, Tata and Jaguar having Grey as the most sold colour. Therefore, when encoding using count they both got the value 2, denoting that their value was repeated twice in the dataset and both had the same value.