Imputation Using Feature Engine

Welcome Back, Folks... !!!

Missing Data is one of the most unavoidable issues, and is always resting in our datasets peacefully waiting to destroy our final Machine Learning models. Thus, when it comes to making a Machine Learning model for our requirement, a majority of time is taken in Cleaning, Analysing and Preparing our Dataset for the final model.

We will be focusing here on Imputating Missing Data, which indeed is a difficult, manual & time killing job. In this regard, in our previous articles, we studied Imputation and its various techniques that can be used to ease our life. To avoid, or better to say reduce our time in Imputing the variables, there are few Python Libraries, that can be used to automate the Imputation task to some extend. We have already studied one such skLearn.SImpleImputer() in previous articles. Here, we will be focusing on a new library, Feature Engine.

What is Feature Engine?

Feature Engine is an open-source Python library, used for the purpose of Data Preprocessing in the field of Data Science and Machine Learning. It was taken from the word Feature Engineering, which means engineering, i.e. modifying/transforming old or creating new variables from the given dataset, with the sole motive of getting complete and interpretable data that can be used for Machine Learning.

Feature Engine comes with a bundle of features that are required in Preprocessing of data. Few things for which we can use the Feature Engine library are:-

1. Missing Data Imputation
2. Discretisation
3. Categorical Encoding
4. Outliers Handling
5. Variable Selection & transformation.

Features of Feature Engine

The main thing that made us introduce this library is, this library is easy to use and Quick in providing results. We simply need few lines of code to perform any of the above actions using this library. Apart from this, like Machine Learning models, this library also uses 'fit' & 'transform' methods to perform the said action. That makes it more familiar for the user & helps in speeding the task.

Enough of the intro, let's jump ahead and learn how to use it.

Getting Ready

Unlike other libraries of python, we need to install this library for use.

1. Installation.

We can install this library directly using the 'pip' command.

## conda installation

conda install feature-engine

## pip installation

pip install feature-engine

## Jupyter Notebook installation

pip install feature-engine

Installing feature-engine

We will be focusing only on the Imputation part of the library for this section & will cover all the code required for them.

So, without wasting much time let's begin the journey...

Mean/Median Imputation

A simple 2 step process to achieve the desired Mean or Median Imputation.

1. Importing the Libraries and Data

First, we need to import the MeanMedianImputer library from the feature_engine library.

Importing the Libraries and Data

2. Performing Imputation

2a. Creating Object

Like any other programming language, we need to create the object(parameterised) of MeanMedianImputer before using it for further processing.

Creating MeanMedianImputer Object for Mean Imputation

2b. Fitting the Data

The next step in this is to 'fit' the data to the Imputer. This step basically calculates the data to be imputed.

Fitting the Data

2c. Verifying the Mean/Median

Once we have 'fit' the data to MeanMedianImputer, now we can check the values that are going to be used for imputing the variables.

Verifying the Mean

2d. Transforming and Verifying

Lastly, we are required to use 'transform' and impute the data. And can verify if the imputation is performed correctly or not.

Transforming and Verifying

Similarly, we can perform Imputation using 'Median'. We just need to change the imputation_method='median' while declaring the object.

Median Imputation using MeanMedianImputer

Missing Indicator

1. Importing the Libraries and Data

Just like MeanMedianImputer, we need to import AddMissingIndicator from feature_engine.

Importing the Libraries and Data

2. Adding Missing Indicators

This step remains the same as MeanMedianImputer, we need to create objects, fit & transform the data.

Adding Missing Indicators

3. Verifying the data

Now, the last thing to do here is to verify the Imputation.

Verifying the data

We can notice here, new variables were created (<VariableName>_na), signifying if the column is having missing data for a particular row or not (1- missing, 0- not missing).

Random Sample Imputation

1. Importing the Libraries and Data

Just like the previous imputation techniques, we need to import RandomSampleImputer from feature_engine.

Importing the Libraries and Data

2. Imputing Random Samples

This step remains the same as previous techniques, we need to create objects, fit & transform the data.

Imputing Random Samples

** We use 'random state' to define a particular random state for selecting the values, if we do not select a random state then every time we execute the code, we get different imputation values. A random state is a random number, we can select any value for it.

3. Verifying the data

Now, the last thing to do here is to verify the Imputation.

Verifying the data

We can clearly notice here, Variables have been imputed with Random values.

Missing Category Imputation

1. Importing the Libraries and Data

Just like the previous imputation techniques, we need to import CategoricalImputer from feature_engine.

Importing the Libraries and Data

2. Imputing Categorical Variables

The point to note here is that it automatically identifies the Categorical variables from the complete dataset and uses only these variables for Imputation.

3. Verifying the data

Now, the last thing to do here is to verify the Imputation.

Verifying the data

We can notice only the Categorical Variables have been imputed with 'Missing', whereas Numerical variables are still having Missing values.

Frequent Category Imputation

1. Importing the Libraries and Data

This remains the same as of Missing Category Imputation.

2. Imputing Frequent Categorical Variables

Here the scenario is a bit different, we need to provide only the Categorical variables and the variables that have distinct, limited categories.

And to use CategoricalImputer as Frequent Category Imputer, we need to specify " imputation_method= 'frequent' ".

Imputing Frequent Categorical Variables

3. Verifying the data

Now, the last thing to do here is to verify the Imputation.

Verifying the data

End Tail Imputation

1. Importing the Libraries and Data

Importing the Libraries and Data

2. Performing End Tail Imputation

Performing End Tail Imputation

EndTailImputation also has a few parameters that can be used to control how the values are imputed.

Parameters:-

a. imputation_method ---- it takes 3 values 'gaussian','iqr' & 'max'.

b. tail --- it takes values 'left' or 'right'.

c. fold --- Factor to multiply with the predicted values.

d. variables --- a list of variables to be used for Imputation.

3. Verifying Data

Verifying End of Tail Imputation

Arbitrary Value Imputation

1. Importing the Libraries and Data

2. Performing Arbitrary Number Imputation

As the name suggests, it's used only for Numerical Variables. And we need to specify the 'Arbitrary Number' that we want to use for Imputation.

Performing Arbitrary Number Imputation

3. Verifying the Imputation

Verifying the Imputation

Summary

In this Quick Note, we studied a famous library 'feature_engine', using this important library we can impute the missing values in few lines of code. We also saw how by using 'feature_engine', we were able to perform the major Imputation techniques very easily, just by using different imports.

Refer to the below chart for easy reference of these techniques.

Feature Engine Techniques

If you are still not sure about the basics, advantages, limitations or How and When to use ImputationTechniques, read it HERE.

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Imputation Using Feature Engine

What is Feature Engine?

Getting Ready

1. Installation.

1. Importing the Libraries and Data

2. Performing Imputation

2a. Creating Object

2b. Fitting the Data

2c. Verifying the Mean/Median

2d. Transforming and Verifying

1. Importing the Libraries and Data

2. Adding Missing Indicators

3. Verifying the data

1. Importing the Libraries and Data

2. Imputing Random Samples

3. Verifying the data

1. Importing the Libraries and Data

2. Imputing Categorical Variables

3. Verifying the data

1. Importing the Libraries and Data

2. Imputing Frequent Categorical Variables

3. Verifying the data

1. Importing the Libraries and Data

2. Performing End Tail Imputation

3. Verifying Data

1. Importing the Libraries and Data

2. Performing Arbitrary Number Imputation

3. Verifying the Imputation

Summary

Labels

Comments

Post a Comment