Skip to main content

Mean or Median Imputation




To understand Mean or Median Imputation, we need to first revise the concepts of Mean & Median. Then it would be easy for us to know why this is a widely used method for imputation and can easily identify its issues. 

We already studied the Missing Data and defined Imputation & its basics in previous articles.

What is Mean? 

Mean is nothing but the arithmetic average of numbers. That is why it is also referred to as Average or Arithmetic Average. The process of finding average/mean is quite simple, we just add all the given values irrespective of the magnitude(+/-) and then divide the total sum by the no. of observations. 

                                Sum of all the observation 
Average/Mean =  ---------------------------------------
                                   No. of Observations

What is Median? 


Median is just another fancy term to define a simple concept of "middle value". In simple terms, when we arrange the given data in ascending order(small to big) Or descending order(big to small), the value that lies exactly in the centre of the arrangement is termed "Median". 

Now comes a point about what to do if the number of observations provided is Even or Odd in number. 

The rule to finding Median/Mid-point is changed slightly. 


\mathrm{Med}(X) = \begin{cases} X[\frac{n}{2}] & \text{if n is even} \\ \frac{(X[\frac{n-1}{2}] + X[\frac{n+1}{2}])}{2} & \text{if n is odd} \end{cases}



X = ordered list of values in the data set.
n = number of values in the data set.


What does Mean or Median Imputation?


Mean or Median Imputation means imputing the missing values with the mean or median value of the column/variable. 

The process involved here is simple and easy to get, we take the mean or median(as per our requirement) of the particular variable/column and replace all the missing values in that particular column/variable with the mean or median calculated. 

Now let's have a look at the assumptions that we need to keep in mind and the limitations of this technique, post that we will be getting our hands dirty with some code.

Assumptions & Key points to remember


Before using this technique we need to keep few things in mind:- 
  • Data is Missing Completely At Random(MCAR)
  • The values that are missing are more likely to be similar to those present in the dataset.
  • Mean or Median is calculated only from the train set and the value is used in the test set.
  • Missing data is not more than 5% of the dataset. 
  • If the variable is skewed, the median is a better option.
  • If the values are normally distributed then using either Mean or Median, will be approximately the same.

Advantages


The plus point of using this technique are:- 
  • It is easy to implement.
  • We can get the complete dataset very quickly.
  • This imputation technique can be used in the Production Model.

Limitations


It also has few limitations attached to it:- 
  • It can lead to distortion in the original variable distribution.
  • It can also distort the original variance of the dataset.
  • The more missing values we have in our dataset, the more will be the distortion. 
  • The covariance will also be distorted with the remaining variables of the dataset.
  • It can only be used for Numerical variables.

Code:- 


Importing Libraries and Data
Importing Libraries and Data


Checking percentage missing values in dataframe
Checking percentage missing values in the data frame


Calculating Mean & Median
Calculating Mean & Median


Checking Imputed values
Checking Imputed values

Checking change in variance
Checking change of variance




Summary


In this Quick Note, we studied a famous Imputation Technique, i.e. Mean or Median Imputation. We looked at the assumptions, advantages and disadvantages of the method and also basic coding in python to achieve Mean or Median Imputation. 

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Comments