Welcome back, friends..!!!
Till now we have seen quite a few techniques, that we can use for Imputing Missing values in the dataset.
We have studied the theory for them and have seen a basic code for using those techniques. But as said there are other libraries that we can use to implement these techniques. So, we are going to study one such library, i.e. skLearn for Imputation.
So let's begin and get our hands dirty and learn these libraries.
*Please Note:- Theory is already covered in previous articles, we will be directly moving to use libraries for the methods.
For demo purposes, we will be using the COVID-19 cases dataset.
Mean or Median Imputation
1. Importing the Libraries and Data
Importing the Libraries and Data |
We are using only a few selected columns only for demo imputing.
We are using the SimpleImputer library from sklearn. impute for performing imputation.
2. Checking the data.
Checking the data |
As we can see, most of the columns are having 'Missing' data. Let's check what is the total missing values in these columns.
Percentage Missing Data |
Most of the columns are having around 3% of missing data and a few have around 14%.
3. Performing Median Imputation
for this, we need to first create an object of the SimpleImputer library specifying the technique that we want to use.
Performing Median Imputation |
4. Checking the values.
Next, we need to verify if the median calculated by SimpleImputer is correct or not.
SimpleImputer returns an array of medians in the order of our columns.
Checking the values |
SimpleImputer has worked well, medians are exactly the same for all the variables.
5. Imputing the values and verifying the data.
Lastly, we need to impute the values in our original dataset and check their value.
Imputing the values and verifying the data. |
6. Performing Mean Imputation
Similar to Median, here also we will create an object of SimpleImputer but with 'Mean' as the technique.
Performing Mean Imputation |
7. Checking the values.
Similarly, we will be verifying the Mean values.
Checking the values. |
8. Imputing the values and verifying the data.
Imputing the values & verifying the data |
We can see, it has worked quite well and we were able to impute the values for variables in one go only, with 2-3 lines of code.
Arbitrary Value Imputation
3. Performing Arbitrary Imputation
Performing Arbitrary Imputation |
4. Imputing the values and verifying the data.
Imputing the values and verifying the data |
Frequent Category Imputation
3. Finding Most Frequent Value
Finding Most Frequent Value |
4. Performing Frequent Category Imputation
Performing Frequent Category Imputation |
5. Verifying the Imputation
Verifying the Imputation |
Comments
Post a Comment