Skip to main content

Posts

Showing posts from October, 2021

Decision Tree Encoding

  Introduction A Decision Tree is a flowchart-like structure in which each internal node represents a condition on an attribute with binary outputs(e.g. Head or Tail in a coin flip), it has node and branches, where the node represents the condition and branches represents the outcome.  These decision trees are very helpful in predicting the binary outcomes of an action. These decision trees can be used not only for building predictive models but also in Imputation, Encoding etc.  In the Case of Variable Encoding, the variables are encoded based on the predictions of the Decision Tree.  A single feature & the target variable is used to fit a decision tree, then the values of original datasets are replaced with the predictions from the Decision tree.

Rare Label Encoding

  Introduction Till now we have seen many techniques for encoding the categorical variables, all having amazing capabilities and performance. But let me put up a question first before diving into another new technique.  Ques.:-   Suppose we have around 50 different values for a variable, a few having a very high frequency of representation and some with very little representation. Which technique are you going to use for encoding here and Why?  Please share your answers below in the comment section. Even if you don't know the correct answer, please give it a try. By engaging yourself you will definitely learn more. DO NOT MOVE AHEAD TILL YOU HAVE THOUGHT/COMMENTED ON AN ANSWER.  So now, continuing to our topic. Rare Label Encoding is a technique used to group values together and assign them under a common "Rare Label" if they have very little representation as compared to the other values. Let's have an example to understand it better. Suppose we have a dataset of 100

Ordinal Encoding

  Introduction When we talk encoding, one thing that usually comes to our mind is why can't we simply write down all the values from a variable in a list and assign them values 1,2,3,4..... and so on. Just like we did in our childhood while playing..!!!  The answer is YES..!!! we can do it.. in fact, we will do it... or rather we are going to do it here...  Ordinal Encoding is encoding the categorical variables with ordinal numbers like 1,2,3,4...etc. This way of encoding can be either done by assigning 'Arbitrary' values to the variables or can be based on some value like Mean, or target data.   Arbitrary Ordinal Encoding:- Here the ordinal numbers are allotted randomly to the variables for the encoding. Mean Ordinal Encoding:- Here the ordinal numbers are allotted based on the Target Mean value(Just like we did in Mean/Target Encoding ) to the variables for the encoding.

One Hot Encoding

  Introduction One of the most famous, most talked and most common methods when it comes to categorical variable encoding is "One Hot Encoding". We all have seen or heard this method somewhere in our DS journey till now. Also, often this method is shown in many Data Science or Machine Learning videos.  So, what makes this technique so special that everyone likes it..!!!  One Hot Encoding is defined as encoding each categorical variable with a different binary variable, i.e. 1 & 0 only. Such that the value 1 is used to represent if the value is present and 0 to represent if it's missing. Here, the number of distinct values in the variables that many new columns are added to indicate if that value is present or not.  Let's have an example to understand it better  Dummy One Hot Encoding

Mean Encoding or Target Encoding

  Introduction  A technique that is most commonly used anywhere and everywhere is the 'Mean'. The first thing that comes to mind of a Data Scientist on seeing huge data is "Calculate the Mean". So, why not use the same technique here also and try to encode our categorical variables using the Mean.  This technique of encoding the categorical variable with the Mean is known as "Mean Encoding" or "Target Encoding".  This technique is known as Target Encoding because the mean of a value in a variable is calculated based on the Target Values. Let's have an example to understand it better...  Suppose, we have a variable of cars and another variable containing the mileage of the cars. So, if a car from Tata has a mileage of 50 then its value is encoded with 0.5, another car from Honda having a mileage of 30 will be assigned/encoded with 0.3.  Dummy Mean Encoding

Count Frequency Encoding

Introduction The first method that is mostly used for Categorical Variable Encoding is "Count Frequency Encoding". This method is used to replace the categorical variable either with their count of values or the percentage share of the value in total space.  Let's see an example to understand it better Dummy Count Frequency Encoding Here we have created dummy data of 6 car companies and the colour of most selling cars on the left-hand side. While on the right-hand side we can see the list of the same cars but the Categorical Variable, i.e colour has been encoded using the Count Frequency Encoder, by both Count and Percentage.  Since there were 2 companies, Tata and Jaguar having Grey as the most sold colour. Therefore, when encoding using count they both got the value 2, denoting that their value was repeated twice in the dataset and both had the same value.

Variable Encoding

Introduction  Computers are one of the best creations of  Human Beings. They are so powerful and useful that which was once a luxury item has now become so common that it can be seen everywhere like watches, cars, spaceships etc, etc. They have become so common now that imagining a life without them is like going back to the 'Stone Age'...  These computerised systems might be great, but have one serious issue, i.e. they work on only Numerical Data, more specifically, Binary Data, i.e 1 & 0 only. But the data we see around us can be Numerical, Alphabetical, Categorical, Visual, Audible and others.  Now, coming to the point, whether it is Machine Learning, Data Science, Deep Learning, or Artificial Intelligence. All these work on data, i.e. they use data to deliver results. But like we know all the data sets are/can be a mixture of Numerical, Alphabetical & Categorical(let's ignore Audio & Visual data for now). Dealing with Numerical data is not an issue with comp

Encoding

  Welcome to another series of Quick Reads... This series of Quick Reads focuses on another major step in the process of Data Preprocessing, i.e. Variable Encoding.  We will be studying every detail from What is Variable Encoding to What techniques do we use with their shortcomings and strengths together with a practical demo. All this is in our series of Quick Reads. Trust us, when we say Quick Reads, then we truly mean teaching and explaining some heavy concepts in Data Science, at the same time in which we cook our 'Maggie'.    INDEX 1. What is Variable Encoding?   2. Techniques used for Variable Encoding     2.1 Count Frequency Encoding     2.2 Mean/ Target Encoding     2.3 One Hot Encoding     2.4 Ordinal Encoding     2.5 Rare Label Encoding     2.6 Decision Tree Encoding

Indexes in SQL

What are Indexes in SQL?  Indexes in SQL are like the "Index" page in a book. They hold the information about the various chapters present in the book along with their page numbers. Whenever we need to go to a particular chapter, we look for that chapter in Index, take its page number and directly jump to that page.  Similarly, in databases also, we can think of the tables as huge books, with each row as a new chapter. The tables in databases can be enormous with lakhs and lakhs of rows and finding a particular row as per our need not only becomes hard but also impossible after an extend. Thus, to overcome this issue a simple concept of Indexing is used. These indexes help us in finding the exact row within a matter of seconds no matter how huge our table is.  Thus, using a simple concept taken from our books to our database, we can significantly reduce the time a query takes to get the desired result set. 

Windows Functions in SQL

Introduction Data and Data, all we can see around us is a world completely surmounted by data, huge or small doesn't matter much. Everywhere we can see is data running our world. Some legends have rightly said  "Data is the Fuel of Future" If we sit down and analyse the world around us right now, we can notice he was truly a legend who said these lines. From the mobile phone, we use to the Television, roads, vehicles, etc. etc everything is using the data and extensively using it. From the recommender systems to the automated machines everything consumes the data that we once generated and is now delivering services through it.  But such a huge amount of data gives birth to many problems like storage & analysis. Though we have traditional technologies like SQL & the latest technologies like Hive, Spark, Pig to handle these Big data, its important to know how they work.   Let me tell you an amazing fact about these techs also uses SQL under the hood & deliver s

Hive Default Partition

  Introduction We have already studied two different partitioning techniques in Hive, Dynamic & Static Partitioning. If we look at these techniques, in Static Partitioning we define the partitions manually whereas in Dynamic Partitioning the values are assigned dynamically based on some column values.  But if we look and analyse the data, we know data is never pure, it always has some impurities in it like some of the data will be missing, some are not formatted properly, some are random, etc etc.  Thus, in such cases when we create partitions a problem arises where to assign the NULL values (I am supposing that we created partitions for wrong values also), do we need to create a separate partition for it, then what would be the condition for NULL values, in case of dynamic partition, when we do not have control over the partitions how are we going to deal with NULL data.  

Outliers Capping

Introduction  In the past few articles, we have seen about Outliers, What are they, How they are introduced and discussed few techniques how to handle these outliers in our dataset.  Another technique that is widely used while handling outliers is capping the data . Capping means defining the limits for a field.  Capping in a sense is similar to trimming the dataset, but the difference here is, while trimming we used IQR or z-score and trimmed the data based on some IQR or z-score value. Here instead of trimming or removing the values from the dataset, we convert the outliers and bring them in the limit or range of our data.

Internal VS External

Introduction Hive may not be the first term that pops in our mind when we talk about Hadoop & Big Data, but it is definitely a term, a tool, a tech that everyone discusses as they proceed in their Big Data journey & never parts from it.  In simple words, if we want to explain What is Hive to any new data science enthusiast, in a single line we can say "Hive is the SQL for big data".  Why? Because it is used to manage huge structured data, query and analyse that data & it sits on top of Hadoop.  Hive is a data warehouse infrastructure, that is used to process, query and analyse the structured data in Hadoop. Structured data, the data having definite structure, i.e. table format. It is designed similar to SQL, similar interface, similar queries. The difference is that like other Hadoop techs in the ecosystem implements Map-Reduce to perform the required task, similarly, traditional SQL queries in the hive, known as HQL(Hive Query Language), implements Map-Reduce job