Skip to main content

Missing Indicator Imputation

Missing Indicator Imputation


Welcome back, friends..!! We are back with another imputation technique which is a bit different than the previous techniques we studied so far, & serves an important role that we knowingly/unknowingly have been skipping throughout the previous techniques. 

We studied many techniques like Mean/Median, Arbitrary Value, CCA, Missing Category, End of tail, Random Samples. If we notice all these techniques were good enough to Impute the Missing Values but the majority of them lacked to mark/flag the observations that were having values/and were imputed. 

Thus, we bring here the technique of Missing Indicator that was designed with the sole purpose of marking or denoting the observation that was/is having a missing value. This technique is mostly used together with one of the previously defined techniques for imputation. 

In simple terms, if we have to explain the technique, then in this technique we use another column/variable to maintain a flag(binary value 0/1, true/false) mostly to denote the observation having a missing value.

So, now let's go ahead and have a look at the assumptions that we need to keep in mind, advantages and the limitations of this technique, post that we will be getting our hands dirty with some code.

Key Points to Remember

  • We assume here that the data is not MAR(Missing At Random).
  • Every missing data has something to say, i.e missing data is predictive in nature.
  • Always used with another Imputation technique.
  • We can use this technique for both Numerical and Categorical Variables.

Advantages

  • It is easy to implement.
  • This technique can capture the importance of "missingness".
  • No assumptions are made on data.

Limitations

  • It expands the feature space.
  • This technique can not be used alone, and we need another technique to impute the missing values.
  • In case of an observation having many missing values across the variables, we need to use many missing indicators, that may end up being identical or very highly correlated.

Code


1. Importing the Libraries and the data.


Importing the Libraries and the data
Importing the Libraries and the data



2. Checking the percentage of missing values in each column.


checking the percentage of missing values
checking the percentage of missing values


3. Performing Imputation


Performing Imputation
Performing Imputation

Here we are simply checking if the column is "Null", then we add '1' to a new column "Age_NA". 

4. Checking Imputation. 


Checking Imputation
Checking Imputation.


We can see, a new column was created with '1' for missing values and '0' for non-missing values. But still, the imputation is incomplete as we have not received our complete data set till now.  Thus, we need to use any other technique to get the complete dataset.

There are many Python libraries also which we can use and perform the Missing Indicator Imputation directly in a single line of code. We will be covering that part in a separate article.

Summary


In this Quick Note, we studied a famous Imputation Technique, i.e. Missing Indicator Imputation. We looked at the assumptions, advantages and disadvantages of the method and also basic coding in python to achieve Missing Indicator Imputation. 

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.


Comments