Skip to main content

End of Tail Imputation

End of Tail Imputation


End of Tail Imputation is another important Imputation technique. This technique was developed as an enhancement or to overcome the problems in the Arbitrary value Imputation technique. In the Arbitrary values Imputation method the biggest problem was selecting the arbitrary value to impute for a variable. Thus, making it hard for the user to select the value and in the case of a large dataset, it became more difficult to select the arbitrary value every time and for every variable/column. 

So, to overcome the problem of selecting the arbitrary value every time, End of Tail Imputation was introduced where the arbitrary value is automatically selecting arbitrary values from the variable distributions.

Now the question comes How do we select the values? & How to Impute the End value?

There is a simple rule to select the value given below:- 

  • In the case of normal distribution of the variable, we can use Mean plus/minus 3 times the standard deviation. 
  • In the case variable is skewed, we can use Inter Quartile Range(IQR) proximity rule.
  • We can also select & multiply the min/max value by a certain constant value such as 2 or 3.

Now let's have a look at the assumptions that we need to keep in mind and the limitations of this technique, post that we will be getting our hands dirty with some code.

Assumptions


The only assumption that we make here is that the value should not be Missing at Random, i.e, Missing Not At Random(MNAR).

Another point that we need to keep in mind before performing this Imputation is that we must divide beforehand and select the value for imputation from the train set only. This selected value must be used for imputation in both train and test sets then. 

Another key point to note here is that as we talked about using Mean, Min/Max value, so by default this method can be used only for Numerical variables.

Advantages


Now let's have a look what are the advantages of this technique:- 
  • It is easy to implement.
  • We can get the complete dataset very quickly.
  • We can use this technique in Production.
  • This technique can capture the importance of "missingness" if it exists.

Disadvantages


It also has few limitations attached to it:-
  • It can lead to distortion in the original variable distribution.
  • It can also distort the original variance of the dataset.
  • The covariance will also be distorted with the remaining variables of the dataset.
  • It can only be used for Numerical variables.
  • True outliers may be masked in the distribution.

Code


1. Importing the Libraries, Data and checking the data.


Importing the Libraries
Importing the Libraries and data


2. Checking percentage missing values in the data frame.


Checking percentage missing values in the data frame.
Checking percentage missing values in the data frame.


3. Calculating Mean and Standard Deviation


Calculating Mean and Standard Deviation
Calculating Mean and Standard Deviation


4. Performing Imputation


Performing Imputation
Performing Imputation


5. Checking the data for Age & Age_Impute columns in the data frame


Checking the data for Age & Age_Impute columns in the data frame
Checking the data for Age & Age_Impute columns in the data frame

6. Verifying change of variance of the data


Verifying change of variance of the data
Verifying change of variance of the data


Thus, we can see the variance has changed significantly for the imputed column. So, we should pay extra caution should be taken while selecting the Imputation technique.

There are many Python libraries also which we can use and perform the End of Tail Imputation directly in a single line of code. We will be covering that part in a separate article.

Summary


In this Quick Note, we studied a famous Imputation Technique, i.e. End of Tail Imputation. We looked at the assumptions, advantages and disadvantages of the method and also basic coding in python to achieve End of Tail Imputation. 

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Comments