Multi-variate Imputation of Chained Equation

We have already studied many techniques used for Missing Data Imputation. The majority of these techniques, that we studied, are or can be used in our final production-ready model. But when it comes to imputing something then there is always a chance of getting it better cause we are never sure if the values imputed are correct or not. Thus, to improve the imputation, we use Multiple imputations, i.e using more than one way to predict the values and then taking average or any other way to get the best suitable value.

We have already seen a technique using similar logic, i.e. KNN Imputation, that uses the K-Nearest Neighbour Algorithm to find the best suitable value. These techniques are better known as "Multi-Variate Imputation". Now, we would like to introduce you to a newer and better technique, which has now become a principal technique for Missing Data Imputation, known as MICE(Multi-variate Imputation of Chained Equation).

Multi-variate Imputation of Chained Equation, better known as MICE, uses a series of steps for imputing the missing values. It is defined as:-

MICE has emerged in the statistical literature as one principled method of addressing missing data. Creating multiple imputations, as opposed to single imputations, accounts for the statistical uncertainty in the imputations. In addition, the chained equations approach is very flexible and can handle variables of varying types (e.g. continuous or binary) as well as complexities such as bounds or survey skip patterns.

Without taking much time, we will dive deep into the process of MICE.

Assumption

Just like other techniques, MICE assumes that the data are Missing At Random(MAR).

Process

The MICE technique is divided into series of steps, as defined below:-

To begin with, let's create dummy data for a better understanding of the process.

Dummy Missing Data

Here, we have created dummy data of the hotel with guest id, age, gender and the rent that was paid. We can notice here that we have the following missing data

=> Age for Guest 102.
=> Gender for Guest 103.
=> Rent for Guest 101.

Step 1:- Impute Missing values using simple imputation techniques like Mean, Median, Mode or random imputation.

We have used random values for demo purposes.

MICE Step 1 Simple Imputation

We have imputed the missing values(marked in Green), using arbitrary values.

Step 2:- Now, we replace the imputed values from one of the variables, i.e. the values are set to Missing for that variable.

MICE Step 2

Here, we have reversed the imputation done in step 1 for Age variable only.

Step 3:- Now, a regression model will be used to predict the values of missing "Age" in the dataset using the other two variables, i.e Gender & Rent.

MICE Step 3

The values marked in the "Red" box will be used to predict the missing values in the "Age" column.

Step 4:- Replacing the values. In this step, the values obtained after running the regression model on the dataset will be used for imputing the data in the "Age" column.

MICE Step 4

Here, the Age "20" is predicted using the values from the "Gender & Rent" columns.

Step 5:- Repeat Step 2-4. Now that we have imputed the values for one variable, the steps from 2,3 & 4 are repeated again and again till we have imputed all the variables, i.e. Gender & Rent in our case.

MICE Step 5

Steps 2,3 & 4 were repeated several times until we obtained a complete dataset with all imputed values(Marked in Red).

Step 6:- Repeat Again. Yes, that's the motto of this step. Here we again repeat the steps from 2-4 using the newly imputed values to once again impute the missing values.

MICE Step 6

After repeating the whole process number of times, we obtained the best fit values for our missing data.
Probably, best possible value because these values are more stable imputed values and even after repeating the steps few more times the imputed values won't vary much. Thus, we can say we have obtained the best possible missing values.

That was all from the theory part of this technique, let's move ahead and have a look at the practical implementation of the same.

Code:-

Python's library skLearn has an inbuilt module for performing the MICE imputation, i.e. "IterativeImputer". This is still in the experimental phase, and available only for 0.21 and higher versions of skLearn. So, we need to enable it first and then we can use it.

1. Importing Libraries

Importing Libraries

2. Importing Dataset.

Importing Dataset

3. Initializing the Imputer

Imputer declaration

4. Performing Imputation

MICE transform

5. Verifying the Data

Verifying the Imputation

Summary

To conclude, we studied a new, better and enhanced technique for more accurate Imputation of missing data, i.e. Multi-variate Imputation of Chained Equation.

This technique uses the famous Machine Learning concept of the iterative imputation for imputing more accurately the missing values in the dataset.

In this article, we studied the Definition and practical implementation of Multi-variate Imputation of Chained Equations.

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

References

https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

QuickDataScience | Quick & Easy Data Science

Search This Blog