Skip to main content

Outliers Capping


Introduction 

In the past few articles, we have seen about Outliers, What are they, How they are introduced and discussed few techniques how to handle these outliers in our dataset. 

Another technique that is widely used while handling outliers is capping the data. Capping means defining the limits for a field. 

Capping in a sense is similar to trimming the dataset, but the difference here is, while trimming we used IQR or z-score and trimmed the data based on some IQR or z-score value. Here instead of trimming or removing the values from the dataset, we convert the outliers and bring them in the limit or range of our data.

Why Capping?

 Capping is also sometimes referred to as Censoring. That is so because, when we use capping techniques in data preprocessing, we do not remove the values rather we convert the values higher than the capping value to capped value.

Sounds confusing..!!! It's simple, instead of trimming or removing the values above the limit, we convert the values to the 'limit'. 

Another great thing about capping is, it can be used for capping/censoring both the Upper & Lower limit of data.

We can perform Capping/Censoring in the following ways:-

- Arbitrary Value
- Quantiles
- Gaussian Approximation
- IQR

So, why waste time let's dive into a practical approach and get our hands dirty. 

Arbitrary Value Capping

Using some Arbitrary value for capping the data. 

1. Importing the Libraries & Data


Importing Libraries & Data
Importing Libraries & Data


2 Verifying Data.


The next thing we need to do is to check the data. 

Original Data
Original Data



3. Using Arbitrary Capping


Time to import and initialize arbitrary outlier capping module from feature engine library.

Initializing Arbitrary capper
Initializing Arbitrary capper


4. Verifying the capped data

Once we have capped the values using the Arbitrary capper, we need to check the values.

Final Data
Final Data


We can notice here that the shape of our data has not changed and is the same as that of what we had begun. The difference here is the max and min age has been capped to the values that we specified while initializing the Arbitrary capper. 

Winsorizing

A fancy name for a simple technique. Winsorizing means Capping, limiting the upper and lower limits of the data to the defined value. Exactly, like we did for Arbitrary Outlier Capping. The difference is that instead of giving some arbitrary value on our own we use different methods to cap, censor or winsorization our data.

There are 3 different ways, using the same class to perform wisorizing. 

1. Importing the Libraries & Data

Importing Libraries & Data
Importing Libraries & Data


We are importing Winsorizer from the feature_engine library & will be using the pre-loaded "Boston" dataset for the demo. 

2. Verifying the limits.


Original Data
Original Data



3. Performing Winsorization


Performing Winsorization
Performing Winsorization


We need to select here the capping method, i.e. 'iqr', 'quantiles' OR 'gaussian'. We have used 'quantiles' for our example. Next, we need to define if we want to cap the data only to a single end or both ends. 

4. Verifying the Transformation.


Final Data
Final Data


We can see that the values are capped based on quantiles for the 'RM' column. Similarly, the values are capped for other columns also. 

Final Data
Final Data

Summary 

In this Quick Note, we studied a very common technique to handle the missing Technique, i.e. Capping the outliers. We looked at the four different techniques that we can use to perform the capping. We also studied winsorization and performing it in a single line of code. We also performed a practical in python to show this technique. 

That's all from Here. Until Then... This is Quick Data Science Team signing off.

Comment below to get the complete notebook and dataset.

Comments