Skip to main content

Imputation Using Feature Engine


Welcome Back, Folks... !!!

Missing Data is one of the most unavoidable issues, and is always resting in our datasets peacefully waiting to destroy our final Machine Learning models. Thus, when it comes to making a Machine Learning model for our requirement, a majority of time is taken in Cleaning, Analysing and Preparing our Dataset for the final model. 

We will be focusing here on Imputating Missing Data, which indeed is a difficult, manual & time killing job. In this regard, in our previous articles, we studied Imputation and its various techniques that can be used to ease our life. To avoid, or better to say reduce our time in Imputing the variables, there are few Python Libraries, that can be used to automate the Imputation task to some extend. We have already studied one such skLearn.SImpleImputer() in previous articles. Here, we will be focusing on a new library, Feature Engine.  

What is Feature Engine? 

Feature Engine is an open-source Python library, used for the purpose of Data Preprocessing in the field of Data Science and Machine Learning. It was taken from the word Feature Engineering, which means engineering, i.e. modifying/transforming old or creating new variables from the given dataset, with the sole motive of getting complete and interpretable data that can be used for Machine Learning.

Feature Engine comes with a bundle of features that are required in Preprocessing of data. Few things for which we can use the Feature Engine library are:- 

1. Missing Data Imputation
2. Discretisation 
3. Categorical Encoding
4. Outliers Handling
5. Variable Selection & transformation.

Features of Feature Engine


The main thing that made us introduce this library is, this library is easy to use and Quick in providing results. We simply need few lines of code to perform any of the above actions using this library. Apart from this, like Machine Learning models, this library also uses 'fit' & 'transform' methods to perform the said action. That makes it more familiar for the user & helps in speeding the task.

Enough of the intro, let's jump ahead and learn how to use it. 

Getting Ready

Unlike other libraries of python, we need to install this library for use. 

1. Installation. 

We can install this library directly using the 'pip' command. 

## conda installation

conda install feature-engine

## pip installation 

pip install feature-engine

## Jupyter Notebook installation

pip install feature-engine


Installing feature-engine
Installing feature-engine


We will be focusing only on the Imputation part of the library for this section & will cover all the code required for them. 

So, without wasting much time let's begin the journey... 

Mean/Median Imputation

A simple 2 step process to achieve the desired Mean or Median Imputation.

1. Importing the Libraries and Data

First, we need to import the MeanMedianImputer library from the feature_engine library. 

Importing the Libraries and Data
Importing the Libraries and Data


2. Performing Imputation 

    2a. Creating Object 

        Like any other programming language, we need to create the object(parameterised) of MeanMedianImputer before using it for further processing. 

Creating MeanMedianImputer Object
Creating MeanMedianImputer Object for Mean Imputation 


     2b. Fitting the Data

        The next step in this is to 'fit' the data to the Imputer. This step basically calculates the data to be imputed.

Fitting the Data
Fitting the Data

    2c. Verifying the Mean/Median 

        Once we have 'fit' the data to MeanMedianImputer, now we can check the values that are going to be used for imputing the variables. 

Verifying the Mean
Verifying the Mean

    2d. Transforming and Verifying

        Lastly, we are required to use 'transform' and impute the data. And can verify if the imputation is performed correctly or not. 

Transforming and Verifying
Transforming and Verifying

Similarly, we can perform Imputation using 'Median'. We just need to change the imputation_method='median' while declaring the object. 

Median Imputation using MeanMedianImputer
Median Imputation using MeanMedianImputer


Missing Indicator

1. Importing the Libraries and Data

Just like MeanMedianImputer, we need to import AddMissingIndicator from feature_engine.

Importing the Libraries and Data
Importing the Libraries and Data

2. Adding Missing Indicators

This step remains the same as MeanMedianImputer, we need to create objects, fit & transform the data.

Adding Missing Indicators
Adding Missing Indicators

3. Verifying the data

Now, the last thing to do here is to verify the Imputation. 

Verifying the data
Verifying the data

We can notice here, new variables were created (<VariableName>_na), signifying if the column is having missing data for a particular row or not (1- missing, 0- not missing). 

Random Sample Imputation

1. Importing the Libraries and Data


Just like the previous imputation techniques, we need to import RandomSampleImputer from feature_engine.

Importing the Libraries and Data
Importing the Libraries and Data


2. Imputing Random Samples


This step remains the same as previous techniqueswe need to create objects, fit & transform the data.

Imputing Random Samples
Imputing Random Samples


** We use 'random state' to define a particular random state for selecting the values, if we do not select a random state then every time we execute the code, we get different imputation values. A random state is a random number, we can select any value for it.

3. Verifying the data


Now, the last thing to do here is to verify the Imputation. 

Verifying the data
Verifying the data


We can clearly notice here, Variables have been imputed with Random values. 

Missing Category Imputation

1. Importing the Libraries and Data


Just like the previous imputation techniques, we need to import CategoricalImputer from feature_engine.

Importing the Libraries and Data
Importing the Libraries and Data


2. Imputing Categorical Variables


The point to note here is that it automatically identifies the Categorical variables from the complete dataset and uses only these variables for Imputation.



3. Verifying the data


Now, the last thing to do here is to verify the Imputation. 

Verifying the data Categorical Imputation
 Verifying the data


We can notice only the Categorical Variables have been imputed with 'Missing', whereas Numerical variables are still having Missing values.

Frequent Category Imputation

1. Importing the Libraries and Data


This remains the same as of Missing Category Imputation. 
 


2. Imputing Frequent Categorical Variables


Here the scenario is a bit different, we need to provide only the Categorical variables and the variables that have distinct, limited categories. 

And to use CategoricalImputer as Frequent Category Imputer, we need to specify " imputation_method= 'frequent' ". 

Imputing Frequent Categorical Variables
Imputing Frequent Categorical Variables

 

3. Verifying the data

Now, the last thing to do here is to verify the Imputation.

Verifying the data Frequent Categorical Imputation
Verifying the data


End Tail Imputation

1. Importing the Libraries and Data


Importing the Libraries and Data End Tail Imputation
Importing the Libraries and Data


2. Performing End Tail Imputation


Performing End Tail Imputation
Performing End Tail Imputation

EndTailImputation also has a few parameters that can be used to control how the values are imputed. 

Parameters:- 

    a. imputation_method  ---- it takes 3 values 'gaussian','iqr' & 'max'.
    b. tail --- it takes values 'left' or 'right'.
    c. fold --- Factor to multiply with the predicted values.
    d. variables --- a list of variables to be used for Imputation. 

3. Verifying Data


Verifying End of Tail Imputation
Verifying End of Tail Imputation

Arbitrary Value Imputation

1. Importing the Libraries and Data




2. Performing Arbitrary Number Imputation


As the name suggests, it's used only for Numerical Variables. And we need to specify the 'Arbitrary Number' that we want to use for Imputation. 

Performing Arbitrary Number Imputation
Performing Arbitrary Number Imputation


3. Verifying the Imputation


Verifying the Imputation
 Verifying the Imputation


Summary


In this Quick Note, we studied a famous library 'feature_engine',  using this important library we can impute the missing values in few lines of code.  We also saw how by using 'feature_engine', we were able to perform the major Imputation techniques very easily, just by using different imports. 

Refer to the below chart for easy reference of these techniques. 

Feature Engine Techniques


If you are still not sure about the basics, advantages, limitations or How and When to use ImputationTechniques, read it HERE

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Comments