Skip to main content

Posts

Showing posts with the label feature engine python

Feature Scaling -- Scaling to Unit Length

  Let's see a more technical feature scaling method, that we can use for scaling our dataset. It is popularly known as "Scaling to Unit Length", as all the features are scaled down using a common value. Unlike previous methods that we have studied so far, used to scale the features based on some value specific to the variable, here all the variables are used to scale the features. Here,  the scaling is done row-wise to make the complete vector has a length of 1, i.e. normalisation procedure normalises the feature vector and not the observation vector.  Note:-  Scikit-learn recommends this scaling procedure for text classification or clustering. Formula Used:-  Scaling to Unit Length can be done using 2 different ways:-  1. Using L1 Norm:-  L1 Norm or popularly known as Manhattan Distance can be used to scale the datasets.  Scaling to Unit Length using Manhattan Distance where l1(x) can be calculated using the below formula. Manhattan Distance Formula 2. Using L2 Norm:- L2

Ordinal Encoding

  Introduction When we talk encoding, one thing that usually comes to our mind is why can't we simply write down all the values from a variable in a list and assign them values 1,2,3,4..... and so on. Just like we did in our childhood while playing..!!!  The answer is YES..!!! we can do it.. in fact, we will do it... or rather we are going to do it here...  Ordinal Encoding is encoding the categorical variables with ordinal numbers like 1,2,3,4...etc. This way of encoding can be either done by assigning 'Arbitrary' values to the variables or can be based on some value like Mean, or target data.   Arbitrary Ordinal Encoding:- Here the ordinal numbers are allotted randomly to the variables for the encoding. Mean Ordinal Encoding:- Here the ordinal numbers are allotted based on the Target Mean value(Just like we did in Mean/Target Encoding ) to the variables for the encoding.

Mean Encoding or Target Encoding

  Introduction  A technique that is most commonly used anywhere and everywhere is the 'Mean'. The first thing that comes to mind of a Data Scientist on seeing huge data is "Calculate the Mean". So, why not use the same technique here also and try to encode our categorical variables using the Mean.  This technique of encoding the categorical variable with the Mean is known as "Mean Encoding" or "Target Encoding".  This technique is known as Target Encoding because the mean of a value in a variable is calculated based on the Target Values. Let's have an example to understand it better...  Suppose, we have a variable of cars and another variable containing the mileage of the cars. So, if a car from Tata has a mileage of 50 then its value is encoded with 0.5, another car from Honda having a mileage of 30 will be assigned/encoded with 0.3.  Dummy Mean Encoding

Count Frequency Encoding

Introduction The first method that is mostly used for Categorical Variable Encoding is "Count Frequency Encoding". This method is used to replace the categorical variable either with their count of values or the percentage share of the value in total space.  Let's see an example to understand it better Dummy Count Frequency Encoding Here we have created dummy data of 6 car companies and the colour of most selling cars on the left-hand side. While on the right-hand side we can see the list of the same cars but the Categorical Variable, i.e colour has been encoded using the Count Frequency Encoder, by both Count and Percentage.  Since there were 2 companies, Tata and Jaguar having Grey as the most sold colour. Therefore, when encoding using count they both got the value 2, denoting that their value was repeated twice in the dataset and both had the same value.

Variable Encoding

Introduction  Computers are one of the best creations of  Human Beings. They are so powerful and useful that which was once a luxury item has now become so common that it can be seen everywhere like watches, cars, spaceships etc, etc. They have become so common now that imagining a life without them is like going back to the 'Stone Age'...  These computerised systems might be great, but have one serious issue, i.e. they work on only Numerical Data, more specifically, Binary Data, i.e 1 & 0 only. But the data we see around us can be Numerical, Alphabetical, Categorical, Visual, Audible and others.  Now, coming to the point, whether it is Machine Learning, Data Science, Deep Learning, or Artificial Intelligence. All these work on data, i.e. they use data to deliver results. But like we know all the data sets are/can be a mixture of Numerical, Alphabetical & Categorical(let's ignore Audio & Visual data for now). Dealing with Numerical data is not an issue with comp

Encoding

  Welcome to another series of Quick Reads... This series of Quick Reads focuses on another major step in the process of Data Preprocessing, i.e. Variable Encoding.  We will be studying every detail from What is Variable Encoding to What techniques do we use with their shortcomings and strengths together with a practical demo. All this is in our series of Quick Reads. Trust us, when we say Quick Reads, then we truly mean teaching and explaining some heavy concepts in Data Science, at the same time in which we cook our 'Maggie'.    INDEX 1. What is Variable Encoding?   2. Techniques used for Variable Encoding     2.1 Count Frequency Encoding     2.2 Mean/ Target Encoding     2.3 One Hot Encoding     2.4 Ordinal Encoding     2.5 Rare Label Encoding     2.6 Decision Tree Encoding

Imputation Using Feature Engine

Welcome Back, Folks... !!! Missing Data is one of the most unavoidable issues, and is always resting in our datasets peacefully waiting to destroy our final Machine Learning models. Thus, when it comes to making a Machine Learning model for our requirement, a majority of time is taken in Cleaning, Analysing and Preparing our Dataset for the final model.  We will be focusing here on Imputating Missing Data , which indeed is a difficult, manual & time killing job. In this regard, in our previous articles, we studied Imputation and its various techniques that can be used to ease our life. To avoid, or better to say reduce our time in Imputing the variables, there are few Python Libraries, that can be used to automate the Imputation task to some extend. We have already studied one such skLearn.SImpleImputer() in previous articles. Here, we will be focusing on a new library, Feature Engine .