What is Imputation?
Why Imputation is Important?
So, after knowing the definition of Imputation, the next question is Why should we use it, and what would happen if I don’t use it?
Here we go with the answers to the above questions
We use imputation because Missing data can cause the below issues: –
- Incompatible with most of the Python libraries used in Machine Learning:- Yes, you read it right. While using the libraries for ML(the most common is skLearn), they don’t have a provision to automatically handle these missing data and can lead to errors.
- Distortion in Dataset:- A huge amount of missing data can cause distortions in the variable distribution i.e it can increase or decrease the value of a particular category in the dataset.
- Affects the Final Model:- the missing data can cause a bias in the dataset and can lead to a faulty analysis by the model.
Another and the most important reason is “We want to restore the complete dataset”. This is mostly in the case when we do not want to lose any(more of) data from our dataset as all of it is important, & secondly, dataset size is not very big, and removing some part of it can have a significant impact on the final model.
Great..!! we got some basic concepts of Missing data and Imputation. Now, let’s have a look at the different techniques of Imputation and compare them. But before we jump to it, we have to know the types of data in our dataset.
Sounds strange..!!! Don’t worry… Most data is of 4 types:- Numeric, Categorical, Date-time & Mixed. These names are quite self-explanatory so not going much in-depth and describing them.
Fig2:- Type of Data |
Imputation Techniques
1. Complete Case Analysis(CCA):-
This is a quite straightforward method of handling the Missing Data, which directly removes the rows that have missing data i.e we consider only those rows where we have complete data i.e data is not missing. This method is also popularly known as “Listwise deletion”.
- Assumptions:-
- Data is Missing At Random(MAR).
- Missing data is completely removed from the table.
- Advantages:-
- Easy to implement.
- No data manipulation is required.
- Limitations:-
- Deleted data can be informative.
- Can lead to the deletion of a large part of the data.
- Can create a bias in the dataset, if a large amount of a particular type of variable is deleted from it.
- The production model will not know what to do with Missing data.
- When to Use:-
- Data is MAR(Missing At Random).
- Good for Mixed, Numerical, and Categorical data.
- Missing data is not more than 5% – 6% of the dataset.
- Data doesn’t contain much information and will not bias the dataset.
- Code:-
Fig 4:- CCA Code |
2. Arbitrary Value Imputation
This is an important technique used in Imputation as it can handle both the Numerical and Categorical variables. This technique states that we group the missing values in a column and assign them to a new value that is far away from the range of that column. Mostly we use values like 99999999 or -9999999 or “Missing” or “Not defined” for numerical & categorical variables.
- Assumptions:-
- Data is not Missing At Random.
- The missing data is imputed with an arbitrary value that is not part of the dataset or Mean/Median/Mode of data.
- Advantages:-
- Easy to implement.
- We can use it in production.
- It retains the importance of “missing values” if it exists.
- Disadvantages:-
- Can distort original variable distribution.
- Arbitrary values can create outliers.
- Extra caution is required in selecting the Arbitrary value.
- When to Use:-
- When data is not MAR(Missing At Random).
- Suitable for All.
- Code:-
Fig 5:- Arbitrary Value Imputation Code |
3. Frequent Category Imputation
This technique says to replace the missing value with the variable with the highest frequency or in simple words replacing the values with the Mode of that column. This technique is also referred to as Mode Imputation.
- Assumptions:-
- Data is missing at random.
- There is a high probability that the missing data looks like the majority of the data.
- Advantages:-
- Implementation is easy.
- We can obtain a complete dataset in very little time.
- We can use this technique in the production model.
- Disadvantages:-
- The higher the percentage of missing values, the higher will be the distortion.
- May lead to over-representation of a particular category.
- Can distort original variable distribution.
- When to Use:-
- Data is Missing at Random(MAR)
- Missing data is not more than 5% – 6% of the dataset.
- Code:-
Fig 6:- Frequent Category Imputation Code |
Here we notice “Male” was the most frequent category thus, we used it to replace the missing data. Now we are left with only 2 categories i.e Male & Female.
Thus, we can see every technique has its Advantages and Disadvantages, and it depends upon the dataset and the situation for which different techniques we are going to use.
That’s all from here…
Happy Learning…
Comments
Post a Comment