Skip to main content

Posts

Showing posts with the label imputing data

Multi-variate Imputation of Chained Equation

We have already studied many techniques used for Missing Data Imputation . The majority of these techniques , that we studied, are or can be used in our final production-ready model. But when it comes to imputing something then there is always a chance of getting it better cause we are never sure if the values imputed are correct or not. Thus, to improve the imputation, we use Multiple imputations , i.e using more than one way to predict the values and then taking average or any other way to get the best suitable value.  We have already seen a technique using similar logic, i.e. KNN Imputation , that uses the K-Nearest Neighbour Algorithm to find the best suitable value. These techniques are better known as " Multi-Variate Imputation ". Now, we would like to introduce you to a newer and better technique, which has now become a principal technique for Missing Data Imputation, known as MICE(Multi-variate Imputation of Chained Equation).  Multi-variate Imputation of Chained Equa

Multiple Imputation

Imputation, seems to be a simple term, "Replacing Missing Data". Also, we have learned a lot many techniques to perform such Imputation in few lines of code. So, let me ask a question to you guys now.  Do you think in practical scenarios where we have very sensitive information like medical data, imputing some missing data based on some Random data would suffice? Will it impact the end analysis?  Before reading ahead, do think of the above question and try to answer it for yourself.  So, coming to the answer, there is a high probability that we might bias the dataset with some static value imputation. Imputation is never a simple job, it takes a lot of time and expertise to impute the correct values, even after that you can't be sure how your end model will perform and have you imputed the correct values. Thus, there was a need to devise a technique that could impute different plausible values and impute with the best one.  As of now, all the imputation techniques we saw

Imputation Using Feature Engine

Welcome Back, Folks... !!! Missing Data is one of the most unavoidable issues, and is always resting in our datasets peacefully waiting to destroy our final Machine Learning models. Thus, when it comes to making a Machine Learning model for our requirement, a majority of time is taken in Cleaning, Analysing and Preparing our Dataset for the final model.  We will be focusing here on Imputating Missing Data , which indeed is a difficult, manual & time killing job. In this regard, in our previous articles, we studied Imputation and its various techniques that can be used to ease our life. To avoid, or better to say reduce our time in Imputing the variables, there are few Python Libraries, that can be used to automate the Imputation task to some extend. We have already studied one such skLearn.SImpleImputer() in previous articles. Here, we will be focusing on a new library, Feature Engine .  

Missing Indicator Imputation

Welcome back, friends..!! We are back with another imputation technique which is a bit different than the previous techniques we studied so far, & serves an important role that we knowingly/unknowingly have been skipping throughout the previous techniques.  We studied many techniques like Mean/Median , Arbitrary Value , CCA , Missing Category , End of tail , Random Samples . If we notice all these techniques were good enough to Impute the Missing Values but the majority of them lacked to mark/flag the observations that were having values/and were imputed.  Thus, we bring here the technique of Missing Indicator that was designed with the sole purpose of marking or denoting the observation that was/is having a missing value. This technique is mostly used together with one of the previously defined techniques for imputation.  In simple terms, if we have to explain the technique, then in this technique we use another column/variable to maintain a flag(binary value 0/1, true/false) mos

Random Sample Imputation

Till now we have seen techniques that were either applicable for Numerical or Categorical variables but not both. So we would like to make you familiar with a new technique that can be easily used for both the Numerical & Categorical variables.   Random Sample Imputation is the technique that is widely used for both the Numerical and Categorical Variables. Do not confuse it with Arbitrary Value Imputation , may seems to be similar by name. In fact, it's totally different. When compared based on the principle used for imputation, it is more similar to Mean/Median/Mode Imputation techniques. This technique also preserves the statistical parameter of the original variable distribution, for the missing data just like Mean/Median / Mode Imputations . Now let's go ahead and have a look at the assumptions that we need to keep in mind, advantages and the limitations of this technique, post that we will be getting our hands dirty with some code.

End of Tail Imputation

End of Tail Imputation is another important Imputation technique. This technique was developed as an enhancement or to overcome the problems in the Arbitrary value Imputation technique. In the Arbitrary values Imputation method the biggest problem was selecting the arbitrary value to impute for a variable. Thus, making it hard for the user to select the value and in the case of a large dataset, it became more difficult to select the arbitrary value every time and for every variable/column.  So, to overcome the problem of selecting the arbitrary value every time, End of Tail Imputation was introduced where the arbitrary value is automatically selecting arbitrary values from the variable distributions. Now the question comes How do we select the values? & How to Impute the End value? There is a simple rule to select the value given below:-  In the case of normal distribution of the variable, we can use Mean plus/minus 3 times the standard deviation.  In the case variable is skewed,

Imputation Techniques

Welcome to a series of articles about Imputation techniques. We will be publishing small articles(Quick Notes) about the various Imputation techniques used, their advantages, disadvantages, when to use and coding involved for them.  Not Sure What is Imputation?   &   What is Missing Data?    Why they are important. Click on the links to know more about them.   1. Mean Or Median Imputation 2.  End of tail Imputation  3. Missing Category Imputation 4. Random Sample Imputation 5. Missing Indicator Imputation 6. Mode Imputation 7. Arbitrary Value Imputation 8. Complete Case Analysis(CCA)  Python Libraries used for Quick & Easy Imputation.  09.  SimpleImputer   10. Feature Engine 11. Multi-Variate Imputation

Mean or Median Imputation

To understand Mean or Median Imputation, we need to first revise the concepts of Mean & Median. Then it would be easy for us to know why this is a widely used method for imputation and can easily identify its issues.  We already studied the Missing Data and defined Imputation & its basics in previous articles. What is Mean?  Mean is nothing but the arithmetic average of numbers. That is why it is also referred to as Average or Arithmetic Average. The process of finding average/mean is quite simple, we just add all the given values irrespective of the magnitude(+/-) and then divide the total sum by the no. of observations.                                  Sum of all the observation  Average/Mean =  ---------------------------------------                                    No. of Observations