Skip to main content

Missing Category Imputation


Till now, we have seen imputation techniques that could only be used for Numerical variables but didn't say anything about the Categorical variables/column.  

So now, we are going to discuss a technique that is mostly used for imputing categorical variables.

Missing Category Imputation is the technique in which we add an additional category for the missing value, as "Missing" in the variable/column. In simple terms we do not take the load of predicting or calculating the value(like we did for Mean/Median or End tail Imputation), we simply put "Missing" as the value. 

Now, we may have a doubt that if we are only replacing the value with "Missing" then why it is said that this method can be used for Categorical variables only? 

Here is the answer, we can use it for Numerical variables also, since we can't introduce a categorical value in the Numerical variables/column, we will be required to introduce some Numerical value that is unique for the variable/column. Now, if we refresh our previous topics, we can find we were doing the same thing in the "Arbitrary Value Imputation", where we used an arbitrary value to replace all the missing values.

Thus, this technique can also be called "Arbitrary Value Imputation for Categorical Variables". 

Easy..!! isn't it...!!!

So, now let's go ahead and have a look at the assumptions that we need to keep in mind, advantages and the limitations of this technique, post that we will be getting our hands dirty with some code.

Key Points to Remember

  • We treat missing data as an additional label/category. 
  • We can use any arbitrary value(not necessarily "Missing"). 
  • This technique is most widely used for categorical variables.

Advantages

  • It is easy to implement.
  • We can get the complete dataset very quickly.
  • We can use this technique in Production.
  • This technique can capture the importance of "missingness".
  • No assumptions are made on data.

Limitations

  • If we have fewer missing values, then creating an additional category may lead to over-fitting trees.
  • If the number of NA is small creating an additional category is in the essence of adding another rare label.
  • For a large number of missing data, it will create a new category with a good amount of data.

Code


1. Importing the Libraries, Data and checking the data.


Importing the Libraries and data
Importing the Libraries and data



2. Checking percentage missing values in the data frame.


Checking percentage missing values in the data frame
Checking percentage missing values in the data frame


3. Checking the values in the column


Checking the values in the column
Checking the values in the column


4. Performing Imputation


Performing Imputation
Performing Imputation


5. Verifying change of the data


Verifying change of the data
Verifying change of the data


We might not be able to see any bar for the "Missing" value as the count is very less as compared to other values.

There are many Python libraries also which we can use and perform the Missing Category Imputation directly in a single line of code. We will be covering that part in a separate article.

Summary


In this Quick Note, we studied a famous Imputation Technique, i.e. Missing Category Imputation. We looked at the assumptions, advantages and disadvantages of the method and also basic coding in python to achieve Missing Category Imputation. 

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Comments