Skip to main content

Posts

Showing posts with the label encoding

Decision Tree Encoding

  Introduction A Decision Tree is a flowchart-like structure in which each internal node represents a condition on an attribute with binary outputs(e.g. Head or Tail in a coin flip), it has node and branches, where the node represents the condition and branches represents the outcome.  These decision trees are very helpful in predicting the binary outcomes of an action. These decision trees can be used not only for building predictive models but also in Imputation, Encoding etc.  In the Case of Variable Encoding, the variables are encoded based on the predictions of the Decision Tree.  A single feature & the target variable is used to fit a decision tree, then the values of original datasets are replaced with the predictions from the Decision tree.

Rare Label Encoding

  Introduction Till now we have seen many techniques for encoding the categorical variables, all having amazing capabilities and performance. But let me put up a question first before diving into another new technique.  Ques.:-   Suppose we have around 50 different values for a variable, a few having a very high frequency of representation and some with very little representation. Which technique are you going to use for encoding here and Why?  Please share your answers below in the comment section. Even if you don't know the correct answer, please give it a try. By engaging yourself you will definitely learn more. DO NOT MOVE AHEAD TILL YOU HAVE THOUGHT/COMMENTED ON AN ANSWER.  So now, continuing to our topic. Rare Label Encoding is a technique used to group values together and assign them under a common "Rare Label" if they have very little representation as compared to the other values. Let's have an example to understand it better. Suppose we have a dataset of 100

Ordinal Encoding

  Introduction When we talk encoding, one thing that usually comes to our mind is why can't we simply write down all the values from a variable in a list and assign them values 1,2,3,4..... and so on. Just like we did in our childhood while playing..!!!  The answer is YES..!!! we can do it.. in fact, we will do it... or rather we are going to do it here...  Ordinal Encoding is encoding the categorical variables with ordinal numbers like 1,2,3,4...etc. This way of encoding can be either done by assigning 'Arbitrary' values to the variables or can be based on some value like Mean, or target data.   Arbitrary Ordinal Encoding:- Here the ordinal numbers are allotted randomly to the variables for the encoding. Mean Ordinal Encoding:- Here the ordinal numbers are allotted based on the Target Mean value(Just like we did in Mean/Target Encoding ) to the variables for the encoding.

One Hot Encoding

  Introduction One of the most famous, most talked and most common methods when it comes to categorical variable encoding is "One Hot Encoding". We all have seen or heard this method somewhere in our DS journey till now. Also, often this method is shown in many Data Science or Machine Learning videos.  So, what makes this technique so special that everyone likes it..!!!  One Hot Encoding is defined as encoding each categorical variable with a different binary variable, i.e. 1 & 0 only. Such that the value 1 is used to represent if the value is present and 0 to represent if it's missing. Here, the number of distinct values in the variables that many new columns are added to indicate if that value is present or not.  Let's have an example to understand it better  Dummy One Hot Encoding

Mean Encoding or Target Encoding

  Introduction  A technique that is most commonly used anywhere and everywhere is the 'Mean'. The first thing that comes to mind of a Data Scientist on seeing huge data is "Calculate the Mean". So, why not use the same technique here also and try to encode our categorical variables using the Mean.  This technique of encoding the categorical variable with the Mean is known as "Mean Encoding" or "Target Encoding".  This technique is known as Target Encoding because the mean of a value in a variable is calculated based on the Target Values. Let's have an example to understand it better...  Suppose, we have a variable of cars and another variable containing the mileage of the cars. So, if a car from Tata has a mileage of 50 then its value is encoded with 0.5, another car from Honda having a mileage of 30 will be assigned/encoded with 0.3.  Dummy Mean Encoding

Count Frequency Encoding

Introduction The first method that is mostly used for Categorical Variable Encoding is "Count Frequency Encoding". This method is used to replace the categorical variable either with their count of values or the percentage share of the value in total space.  Let's see an example to understand it better Dummy Count Frequency Encoding Here we have created dummy data of 6 car companies and the colour of most selling cars on the left-hand side. While on the right-hand side we can see the list of the same cars but the Categorical Variable, i.e colour has been encoded using the Count Frequency Encoder, by both Count and Percentage.  Since there were 2 companies, Tata and Jaguar having Grey as the most sold colour. Therefore, when encoding using count they both got the value 2, denoting that their value was repeated twice in the dataset and both had the same value.

Variable Encoding

Introduction  Computers are one of the best creations of  Human Beings. They are so powerful and useful that which was once a luxury item has now become so common that it can be seen everywhere like watches, cars, spaceships etc, etc. They have become so common now that imagining a life without them is like going back to the 'Stone Age'...  These computerised systems might be great, but have one serious issue, i.e. they work on only Numerical Data, more specifically, Binary Data, i.e 1 & 0 only. But the data we see around us can be Numerical, Alphabetical, Categorical, Visual, Audible and others.  Now, coming to the point, whether it is Machine Learning, Data Science, Deep Learning, or Artificial Intelligence. All these work on data, i.e. they use data to deliver results. But like we know all the data sets are/can be a mixture of Numerical, Alphabetical & Categorical(let's ignore Audio & Visual data for now). Dealing with Numerical data is not an issue with comp

Encoding

  Welcome to another series of Quick Reads... This series of Quick Reads focuses on another major step in the process of Data Preprocessing, i.e. Variable Encoding.  We will be studying every detail from What is Variable Encoding to What techniques do we use with their shortcomings and strengths together with a practical demo. All this is in our series of Quick Reads. Trust us, when we say Quick Reads, then we truly mean teaching and explaining some heavy concepts in Data Science, at the same time in which we cook our 'Maggie'.    INDEX 1. What is Variable Encoding?   2. Techniques used for Variable Encoding     2.1 Count Frequency Encoding     2.2 Mean/ Target Encoding     2.3 One Hot Encoding     2.4 Ordinal Encoding     2.5 Rare Label Encoding     2.6 Decision Tree Encoding