Skip to main content

Feature Scaling -- Standardization



In our previous article/blog we had an overview of Feature Scaling. We saw What is Feature Scaling and How we can use it for our benefit. Basically, we discussed all the theory related to Feature Scaling, lets move ahead and see What are the various ways to achieve Feature Scaling and How we can implement them. 

The very first and most important technique is "Standardization" also known as "Z-Score Normalization". 

The basic idea of this technique is to subtract the mean from each value and divide it by the standard deviation. Doing so will centre the data around its mean with unit standard deviation. 

Formula Used:- 


standardization formula

This formula is also known as Z-Score, hence the name Z-Score Normalization. 

What is Z-Score?

Internet Defines it as:- 

A Z-score is a numerical measurement that describes a value's relationship to the mean of a group of values.

Z-Score represents how many standard deviations a given observation deviates from the mean. It also specifies the location of the observation within a distribution (in numbers of standard deviations concerning the mean of the distribution). The sign of the z-score (+ or - ) indicates whether the observation is above (+) or below ( - ) the mean. 
 

Features Of Standardization


1. Preserves the shape of Original data distribution:- 

it means the original shape of our data distribution remains the same. We can see the same in the below example also, where data has been centre along the centre and its distribution remains unchanged. 


Preserved original data distribution



2. Preserves the outliers:- 

Since the scaling is done on mean and standard deviation, outliers remain untouched and are preserved.

3. Sets the Mean at 0(zero). 


4. Sets the variance to 1.


5. Minimum and Maximum values vary. 

It means the maximum and minimum values for each variable can differ and does not necessarily need to be the same. 

Practical


Time to get our hands dirty and implement it. 

We will be using the "Boston Housing Dataset" for demo purposes.

1. Importing the necessities:-


 
Importing Data and Libraries



2. Getting Data Insights:- 


For getting any meaningful insights from the data we first need to be familiar with the data. i.e No. of Rows/Columns, Type of data, What that variable represents, their magnitude etc. etc.

To get a rough idea of our data we use the .head() method.

Boston Data Overview


To know in detail about the dataset, i.e what each variable represents we can use .DESCR()  method. 

Description of Boston House Data


To further get the mathematical details from the data, we can use the .describe() method. 

Mathematical Description of Boston House Data


3. Scaling the Data


StandardScaler method from the sklearn package is the implementation of Standardization, hence we will be using it directly here.

Implementing StandarScaler


Once, the data is scaled we can check the mean and standard deviation of our data. 

Mean and Standard Deviation


4. Verifying the Scaling


To verify the end result first, we need to convert the scaled_data to a pandas dataframe.

Converting scaled data to dataframe

Next,  we need to verify if the data has been scaled or not. For which we need to use the .describe() method again. 

Scaled Data:- 

Describing scaled data

We can clearly notice here that the mean has reduced to 0 and the standard deviation to 1.

Original Data:- 


Describing Original data


Summary

We have studied Standardization, a technique most commonly used for Feature Scaling. Here is a Quick Note on the technique:- 

1. Centers the mean at 0.
2. Scales the variance at 1.
3. Preserves the shape of the original distribution.
4. The minimum and maximum values of the different variables may vary.
5. Preserves Outliers.
6. Good for algorithms that require features centred at zero.

Happy Learning... !!

Comments