Skip to main content

Imputation Using SimpleImputer



Welcome back, friends..!!! 

Till now we have seen quite a few techniques, that we can use for Imputing Missing values in the dataset. 

We have studied the theory for them and have seen a basic code for using those techniques. But as said there are other libraries that we can use to implement these techniques. So, we are going to study one such library, i.e. skLearn for Imputation.

So let's begin and get our hands dirty and learn these libraries. 

*Please Note:- Theory is already covered in previous articles, we will be directly moving to use libraries for the methods.
For demo purposes, we will be using the COVID-19 cases dataset. 

Mean or Median Imputation 

1. Importing the Libraries and Data

Importing the Libraries and Data
Importing the Libraries and Data

We are using only a few selected columns only for demo imputing.

We are using the SimpleImputer library from sklearn. impute for performing imputation.

2. Checking the data.

Checking the data.
Checking the data

As we can see, most of the columns are having 'Missing' data. Let's check what is the total missing values in these columns. 

Percentage Missing Data
Percentage Missing Data

Most of the columns are having around 3% of missing data and a few have around 14%. 

3. Performing Median Imputation 

for this, we need to first create an object of the SimpleImputer library specifying the technique that we want to use.

Performing Median Imputation
Performing Median Imputation 

4. Checking the values. 

Next, we need to verify if the median calculated by SimpleImputer is correct or not. 

SimpleImputer returns an array of medians in the order of our columns. 

Checking the values
Checking the values

SimpleImputer has worked well, medians are exactly the same for all the variables.

5. Imputing the values and verifying the data.

Lastly, we need to impute the values in our original dataset and check their value.

Imputing the values and verifying the data.
Imputing the values and verifying the data.

We can see, it has worked quite well and we were able to impute the values for variables in one go only, with 2-3 lines of code.

6. Performing Mean Imputation

Similar to Median, here also we will create an object of SimpleImputer but with 'Mean' as the technique.

Performing Mean Imputation
Performing Mean Imputation

7. Checking the values. 

Similarly, we will be verifying the Mean values. 

Checking the values.
Checking the values. 

8. Imputing the values and verifying the data.

Imputing the values and verifying the data.
Imputing the values & verifying the data

We can see, it has worked quite well and we were able to impute the values for variables in one go only, with 2-3 lines of code.

Arbitrary Value Imputation

Let's move ahead and learn about another technique Arbitrary Value Imputation using SimpleImputer.

The first two(2) steps remain similar to the previous technique only. We are directly proceeding to Imputation (step 3).

3. Performing Arbitrary Imputation  


Performing Arbitrary Imputation
Performing Arbitrary Imputation

Here, in strategy, we have used 'Constant', apart from that we also need to provide a fill value(constant value) for imputation.

4. Imputing the values and verifying the data.


Imputing the values and verifying the data
Imputing the values and verifying the data

Frequent Category Imputation

Let's move ahead and learn about another technique Frequent Category Imputation using SimpleImputer.

The first two(2) steps remain similar to the previous technique only. We are directly proceeding to Imputation (step 3).

3. Finding Most Frequent Value


Before performing the Frequent Category Imputation, let's check out the 'Most Frequent' value for each variable/column. 

Finding Most Frequent Value
Finding Most Frequent Value


4. Performing Frequent Category Imputation


Performing Frequent Category Imputation
Performing Frequent Category Imputation


Here, in strategy, we have used 'most_frequent' for imputation. It is the same as finding the 'Mode' for each variable.


5. Verifying the Imputation


Verifying the Imputation
Verifying the Imputation

Summary

In this Quick Note, we studied a python module of a famous library 'skLearn', i.e. 'SimpleImputer',  using this important library we can impute the missing values in few lines of code.  We also saw how by using SimpleImputer, we were able to perform the major Imputation techniques very easily, mostly by just changing a single word, 'Strategy' used for Imputation. 

If you are still not sure about the basics, advantages, limitations or How and When to use ImputationTechniques, read it HERE

Also, Please note we can use Missing Category Imputation, same as the Arbitrary Imputation

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Comments