Skip to main content

Trimming the Outliers


Introduction

 The grass is not as green as it can be seen from outside when it comes to Machine Learning or Data Science. The end result of designing a perfect hypothesised Model is rarely possible not because ML is not powerful, but there is a long tedious and repetitive work of cleaning, analysing and polishing the dataset is involved. 

 One thing that we need to take care of in this Cleaning and improving process is "The Outliers". This is a mere term with a simplistic meaning but is troublesome to handle/manage the data when introduced in it.

 Still unaware of Outliers, How they are introduced? and How to identify them? Read it Here > Mystery of Outliers

 Let's begin with the first technique to Handle outliers. 

Trimming the Outliers

 The most simple and easiest way to handle/avoid any problem that we are taught from the very beginning is to "Remove It". 

 We are also going to follow the same that we learned from birth in case of outliers, i.e. we are going to get rid of the problematic data. This process of removing the data which are most likely to be outliers is known as "Trimming the Outliers".  Sometimes few Geeks also like to say it "Truncating Outliers".

This is a simple technique, very very simple technique, as simple as pressing the "Delete" button on the keyboard. So, let's have a look at the Advantages and Disadvantages of this technique before we perform a practical demo. 

Advantages

This technique is very similar to us when it comes to advantages:- 

  • It is Quick in implementation.
  • It is Easy to implement.
  • And both Quick and Easy to grasp. 

Disadvantages

Coming to the negative part of this technique. 

  • Outlier for one variable can contain useful information for another variable. 
  • We can land up removing a large chunk of the dataset, in case of too many outliers.
  • We need to define the cutoff for being an outlier by ourselves. 

Practical


We would be using the Boston house dataset, it comes preloaded in the skLearn library. 


1. Loading the Libraries

Loading libraries
Loading libraries

We are importing Pandas, NumPy and stats library for this tutorial. Pandas for creating a data frame, NumPy for processing and scipy.stats for calculating z-score.


2. Loading the Boston Dataset

Loading Boston House Price dataset
Loading Boston House Price dataset 


for importing the Boston dataset, we need to load it from skLearn.datasets and then create a dataframe for it. 

3. Calculating z-score


Next, we can use methods like z-score, IQR to define the cutoff limit for outliers. 

Calculating z-score
Calculating z-score 



4. Trimming the data


Once we have decided the cutoff limit(3 in our case), next we have to trim the dataset based on the cutoff value.

Trimming the Boston data
Trimming the data

 

5. Verifying


Once, we are done with trimming, now we need to verify the dataset. 

Verifying the final data
Verifying the final data


We could notice that around 100 rows of data have been dropped from the original dataset, which means we lost around 20% of the data, just to avoid outliers. 

Thus, it is a risky task to trim the data. 

Summary


In this Quick Note, we studied a very common technique to handle the missing Technique, i.e. Trimming the outliers. We looked at the definition, advantages and disadvantages of the method and also performed a practical in python to show this technique. 

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Comments