Skip to main content

Posts

Feature Scaling -- Maximum Absolute Scaling

  In previous articles, we read about Feature Scaling and two of the most important techniques used for feature scaling, i.e. Standardization  & MinMaxScaling . Here we will see another feature scaling technique that can be used to scale the variables and is somewhat similar to the MinMaxScaling technique. This technique is popularly known as MaxAbsScaling or Maximum Absolute Scaling . What is MaxAbsScaling? Maximum Absolute Scaling is the technique of scaling the data to its absolute maximum value. The logic used here is to divide each value by the Absolute Maximum Value for each variable/column. Doing so will scale down all the values between -1 to 1.  It can be implemented easily in a few lines of code, as shown below in the  practical section.   Note:- Scikit-learn recommends using this transformer on data that is centred at zero or on sparse data. Formula Used:-  MaxAbsScaling Formula Features of MaxAbsScaling:-  1.  Minimum and Maximum values are scaled between [-1,1]:-  Sin

Feature Scaling -- Min Max Scaling

  In our previous article, we read about Feature Scaling and the most common technique used to perform feature scaling i.e. Standardization .  Another important and commonly used technique is " Min-Max Scaling" or " Normalization". As the name suggests, Min-Max Scaling is the technique where the variables are scaled based on their Minimum and Maximum values.  Formula Used:-  Min Max Scaling Formula Unlike Standardization, mean is not used here. Rather the Minimum and Maximum values for each variable are used to find the new scaled value.  The logic used here is to subtract the Minimum value from each value and divide it by the difference between maximum and minimum values. Features of Min-Max Scaling 1. Mean is not centred at 0:-  Since in Min-Max scaling, we use the Minimum and Maximum values for scaling each variable separately thus, the mean may or may not get centred at 0. We can see this in the below example, where the mean for all variables is greater than 0

Feature Scaling -- Standardization

In our previous article/blog we had an overview of Feature Scaling . We saw What is Feature Scaling and How we can use it for our benefit. Basically, we discussed all the theory related to Feature Scaling, lets move ahead and see What are the various ways to achieve Feature Scaling and How we can implement them.  The very first and most important technique is " Standardization " also known as " Z-Score Normalization ".  The basic idea of this technique is to subtract the mean from each value and divide it by the standard deviation. Doing so will centre the data around its mean with unit standard deviation.  Formula Used:-  standardization formula This formula is also known as Z-Score, hence the name  Z-Score Normalization.  What is Z-Score? Internet Defines it as:-  A Z-score is a numerical measurement that describes a value's relationship to the mean of a group o

Feature Scaling

Let's begin with a famous saying... "Five Fingers are Never Equal".  Yes, we have heard it a lot, it's true in every case, even in Data Science and Machine Learning... The very first step in the journey of Data Science begins with Data Collection, and this is where we knowingly or unknowingly collect some data which are different in size, units etc. which makes the data vary and inconsistent data. In case we collect the vehicle data we might have the top speed in MPH, distance covered in KM, dimensions of the vehicle in CM/Inch, Model No. with no unit etc.  Sample data for Feature Scaling Thus, when we take this type of raw data and directly pass it through our Machine Learning Algorithms, it will give inconsistent results as the Machine understands no. only and not the units. So, it might give more weightage to the length of the car (1300mm) than the mileage of the car(30kmpl).   What is Feature Scaling? We have seen some quick info about the problem statement, now l

Discretisation -- Decision Tree

  One of the most favourite processes in Data Science is using Tree-based algorithms to find & predict the values. Trees are so popular in this field because they work over binary answers {Yes: No, 1: 0, True: False}, and when we can provide a clear difference between Yes or No, it becomes easy to better analyse things.  Thus, in the field of Data, when we have way too much inflow of data, it is always preferred to get cut to point answers to the questions.  So, why not use the same technique with Discretisation...!!! 

Discretisation -- Equal Frequency

  We have studied a few techniques commonly used for the process of Discretisation or binning. We are here to discuss another important technique that we can use for binning is -- dividing the data into equal size groups, i.e. total data is divided into groups/bins each containing an equal amount of data.  The important part here to note is that widths of each bin may defer in this case, i.e. one bin can be 0-5 and another might be of size 70-100. 

Discretisation -- Equal Widths

  Discretisation or Binning, the process of dividing the data into equal intervals or bins. Yes, we have studied this explanation but how we can do it?  Still, this question keeps running into our minds... Sit back and relax we have got you covered...  Here we are going to learn simple and easy techniques that we use in our daily life also -- dividing the data into equal intervals. i.e we divide the into N-equal groups of the same width(gaps).  Equal Width Binning Example    Let's understand it better using the above example(image). Here we had values ranging from 0-300, quite a large width or range for doing any analysis and visualisation. Thus, we decided to divide the data into equal widths of 20(bins of 20), i.e 0-20,21-40,41-60* and so on. * We used 21,41,61,... because we wanted to make clear that range is inclusive of the upper limit. Doing so we were able to group the data into 15 bins which not only made it easy to visualize but also helps in analysing the data better and

Discretisation

  The process of converting analogue or continuous variables/data into discrete variables/data is known as Discretisation.  The discretisation is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals that span the range of the variable's values. The discretisation is also called binning , where the bin is an alternative name for an interval. Not Sure why we are introducing and referring to this term here..?? Let's find it out in the next section. 

Data Visualization — IPL Data Set (Part 2)

  Welcome to the 3rd Post in the series of Data Visualization, one of the most loved/followed topics of India — IPL (Indian Premier League) (Part 2) 2008–2020. In Part1 we did an analysis based on the Teams , here we will be doing analysis based on all other fields and try to cover some very interesting and unique analyses. Overview of the Data Set Description of columns of IPL Dataset -1 Description of columns of IPL Dataset -2 Let’s Begin by Checking Data in these columns... IPL Data Set 1 Overview IPL Data Set 2 Overview Let’s begin with some visualization and finding the top 10 players, by analyzing the No. of MoM(Man of the Match) awards achieved. MoM Awards The above graph shows the top 10 players of IPL with the most number of Man of the Match Awards… and guess what… it's none other than our Mr. 360 (ABD) with 23 awards followed by The Universe Boss (Gayle 333) 22 awards, roHIT MAN of India with 18, Warner and Captain Cool (MSD) with 17 each. Just a random thought of checkin