Skip to main content

Posts

Showing posts from 2021

Discretisation -- Decision Tree

  One of the most favourite processes in Data Science is using Tree-based algorithms to find & predict the values. Trees are so popular in this field because they work over binary answers {Yes: No, 1: 0, True: False}, and when we can provide a clear difference between Yes or No, it becomes easy to better analyse things.  Thus, in the field of Data, when we have way too much inflow of data, it is always preferred to get cut to point answers to the questions.  So, why not use the same technique with Discretisation...!!! 

Discretisation -- Equal Frequency

  We have studied a few techniques commonly used for the process of Discretisation or binning. We are here to discuss another important technique that we can use for binning is -- dividing the data into equal size groups, i.e. total data is divided into groups/bins each containing an equal amount of data.  The important part here to note is that widths of each bin may defer in this case, i.e. one bin can be 0-5 and another might be of size 70-100. 

Discretisation -- Equal Widths

  Discretisation or Binning, the process of dividing the data into equal intervals or bins. Yes, we have studied this explanation but how we can do it?  Still, this question keeps running into our minds... Sit back and relax we have got you covered...  Here we are going to learn simple and easy techniques that we use in our daily life also -- dividing the data into equal intervals. i.e we divide the into N-equal groups of the same width(gaps).  Equal Width Binning Example    Let's understand it better using the above example(image). Here we had values ranging from 0-300, quite a large width or range for doing any analysis and visualisation. Thus, we decided to divide the data into equal widths of 20(bins of 20), i.e 0-20,21-40,41-60* and so on. * We used 21,41,61,... because we wanted to make clear that range is inclusive of the upper limit. Doing so we were able to group the data into 15 bins which not only made it easy to visualize but also helps in analysing the data better and

Discretisation

  The process of converting analogue or continuous variables/data into discrete variables/data is known as Discretisation.  The discretisation is the process of transforming continuous variables into discrete variables by creating a set of contiguous intervals that span the range of the variable's values. The discretisation is also called binning , where the bin is an alternative name for an interval. Not Sure why we are introducing and referring to this term here..?? Let's find it out in the next section. 

Data Visualization — IPL Data Set (Part 2)

  Welcome to the 3rd Post in the series of Data Visualization, one of the most loved/followed topics of India — IPL (Indian Premier League) (Part 2) 2008–2020. In Part1 we did an analysis based on the Teams , here we will be doing analysis based on all other fields and try to cover some very interesting and unique analyses. Overview of the Data Set Description of columns of IPL Dataset -1 Description of columns of IPL Dataset -2 Let’s Begin by Checking Data in these columns... IPL Data Set 1 Overview IPL Data Set 2 Overview Let’s begin with some visualization and finding the top 10 players, by analyzing the No. of MoM(Man of the Match) awards achieved. MoM Awards The above graph shows the top 10 players of IPL with the most number of Man of the Match Awards… and guess what… it's none other than our Mr. 360 (ABD) with 23 awards followed by The Universe Boss (Gayle 333) 22 awards, roHIT MAN of India with 18, Warner and Captain Cool (MSD) with 17 each. Just a random thought of checkin

Data Visualization — IPL Data Set (Part 1)

Welcome to the 2nd Post in the series of Data Visualization, one of the most loved/followed topics of India — IPL (Indian Premier League) (Part 1) In this, we will be focusing on the various analysis based on the Teams. Overview of the Data Set Description of columns of IPL Dataset Let’s Begin by Checking Data in these columns IPL Data Set 1 Overview IPL Data Set 2 Overview Moving towards the most interesting part, Visualize the dataset and relations. Let's begin by having a look at the total wins by each Team since 2008... Team VS No. of Match Wins The above Bar chart shows the top 5 teams with the most number of Match Wins across all the seasons. Surprisingly, RCB in among the top 5 still hasn’t won any IPL Season. Now Let’s have a look at these wins Team VS Wins based on runs/wickets The above Bar chart is a detailed version of the previous graph which shows the top 5 teams with the most wins divided by wins achieved batting first and batting second. Blue Bars represent the wins

Data Visualization — Netflix Data Set

Welcome to the First Post in the series of Data Visualization, of one of the best time passes and Entertainment for people around the globe — Netflix. We will be going through the dataset and having an overview of the content present on Netflix. Let’s have an overview of the dataset. Description of columns of Netflix Dataset Moving on.. and giving a look at the data present in these columns Netflix Data Overview Let’s Move forward and start with visualizing the data and getting some insights about the data. Firstly, let's see the number of shows based on the type present with us. Number of shows based on types From the above graph, we can notice we have around 5400 Movies data and 2400 TV Shows data present with us. It indicates that No. of movies released on Netflix is higher than the No. of TV Shows released & we can say Netflix is considered more to cinema halls rather than TV sets. Now let’s have a look at the countries producing the most No. of shows for Netflix. Top 20 co

Decision Tree Encoding

  Introduction A Decision Tree is a flowchart-like structure in which each internal node represents a condition on an attribute with binary outputs(e.g. Head or Tail in a coin flip), it has node and branches, where the node represents the condition and branches represents the outcome.  These decision trees are very helpful in predicting the binary outcomes of an action. These decision trees can be used not only for building predictive models but also in Imputation, Encoding etc.  In the Case of Variable Encoding, the variables are encoded based on the predictions of the Decision Tree.  A single feature & the target variable is used to fit a decision tree, then the values of original datasets are replaced with the predictions from the Decision tree.

Rare Label Encoding

  Introduction Till now we have seen many techniques for encoding the categorical variables, all having amazing capabilities and performance. But let me put up a question first before diving into another new technique.  Ques.:-   Suppose we have around 50 different values for a variable, a few having a very high frequency of representation and some with very little representation. Which technique are you going to use for encoding here and Why?  Please share your answers below in the comment section. Even if you don't know the correct answer, please give it a try. By engaging yourself you will definitely learn more. DO NOT MOVE AHEAD TILL YOU HAVE THOUGHT/COMMENTED ON AN ANSWER.  So now, continuing to our topic. Rare Label Encoding is a technique used to group values together and assign them under a common "Rare Label" if they have very little representation as compared to the other values. Let's have an example to understand it better. Suppose we have a dataset of 100