Skip to main content

Posts

Showing posts with the label missing data imputation

Imputation Using Feature Engine

Welcome Back, Folks... !!! Missing Data is one of the most unavoidable issues, and is always resting in our datasets peacefully waiting to destroy our final Machine Learning models. Thus, when it comes to making a Machine Learning model for our requirement, a majority of time is taken in Cleaning, Analysing and Preparing our Dataset for the final model.  We will be focusing here on Imputating Missing Data , which indeed is a difficult, manual & time killing job. In this regard, in our previous articles, we studied Imputation and its various techniques that can be used to ease our life. To avoid, or better to say reduce our time in Imputing the variables, there are few Python Libraries, that can be used to automate the Imputation task to some extend. We have already studied one such skLearn.SImpleImputer() in previous articles. Here, we will be focusing on a new library, Feature Engine .  

Imputation Using SimpleImputer

Welcome back, friends..!!!  Till now we have seen quite a few techniques, that we can use for Imputing Missing values in the dataset.  We have studied the theory for them and have seen a basic code for using those techniques. But as said there are other libraries that we can use to implement these techniques. So, we are going to study one such library, i.e. skLearn for Imputation. So let's begin and get our hands dirty and learn these libraries.  *Please Note:- Theory is already covered in previous articles, we will be directly moving to use libraries for the methods. For demo purposes, we will be using the COVID-19 cases dataset. 

Missing Indicator Imputation

Welcome back, friends..!! We are back with another imputation technique which is a bit different than the previous techniques we studied so far, & serves an important role that we knowingly/unknowingly have been skipping throughout the previous techniques.  We studied many techniques like Mean/Median , Arbitrary Value , CCA , Missing Category , End of tail , Random Samples . If we notice all these techniques were good enough to Impute the Missing Values but the majority of them lacked to mark/flag the observations that were having values/and were imputed.  Thus, we bring here the technique of Missing Indicator that was designed with the sole purpose of marking or denoting the observation that was/is having a missing value. This technique is mostly used together with one of the previously defined techniques for imputation.  In simple terms, if we have to explain the technique, then in this technique we use another column/variable to maintain a flag(binary value 0/1, true/false) mos

Random Sample Imputation

Till now we have seen techniques that were either applicable for Numerical or Categorical variables but not both. So we would like to make you familiar with a new technique that can be easily used for both the Numerical & Categorical variables.   Random Sample Imputation is the technique that is widely used for both the Numerical and Categorical Variables. Do not confuse it with Arbitrary Value Imputation , may seems to be similar by name. In fact, it's totally different. When compared based on the principle used for imputation, it is more similar to Mean/Median/Mode Imputation techniques. This technique also preserves the statistical parameter of the original variable distribution, for the missing data just like Mean/Median / Mode Imputations . Now let's go ahead and have a look at the assumptions that we need to keep in mind, advantages and the limitations of this technique, post that we will be getting our hands dirty with some code.

Missing Category Imputation

Till now, we have seen imputation techniques that could only be used for Numerical variables but didn't say anything about the Categorical variables/column.   So now, we are going to discuss a technique that is mostly used for imputing categorical variables. Missing Category Imputation is the technique in which we add an additional category for the missing value, as "Missing" in the variable/column. In simple terms we do not take the load of predicting or calculating the value(like we did for Mean/Median or End tail Imputation ), we simply put "Missing" as the value.  Now, we may have a doubt that if we are only replacing the value with "Missing" then why it is said that this method can be used for Categorical variables only?  Here is the answer, we can use it for Numerical variables also, since we can't introduce a categorical value in the Numerical variables/column, we will be required to introduce some Numerical value that is unique for the va

Missing Data -- Understanding The Concepts

  Introduction Machine Learning seems to be a big fascinating term, which attracts a lot of people towards it, and knowing what all we can achieve through it makes the sci-fi imagination of ours jump to another level. No doubt in it, it is a great field and we can achieve everything from an automated reply system to a house cleaning robots, from recommending a movie or a product to help in detecting disease. Most of the things that we see today have already started using ML to better themselves. Though building a model is quite easy, the most challenging task is preprocessing the data and filtering out the Data of Use. So, here I am going to address one of the biggest and common issues that we face at the start of the journey of making a Good ML Model, which is  The   Missing Data . Missing Data can cause many issues and can lead to wrong predictions of our model, which looks like our model failed and started over again. If I have to explain in simple terms, data is like Fuel of our Mo