Skip to main content

Posts

Showing posts from August, 2021

Missing Indicator Imputation

Welcome back, friends..!! We are back with another imputation technique which is a bit different than the previous techniques we studied so far, & serves an important role that we knowingly/unknowingly have been skipping throughout the previous techniques.  We studied many techniques like Mean/Median , Arbitrary Value , CCA , Missing Category , End of tail , Random Samples . If we notice all these techniques were good enough to Impute the Missing Values but the majority of them lacked to mark/flag the observations that were having values/and were imputed.  Thus, we bring here the technique of Missing Indicator that was designed with the sole purpose of marking or denoting the observation that was/is having a missing value. This technique is mostly used together with one of the previously defined techniques for imputation.  In simple terms, if we have to explain the technique, then in this technique we use another column/variable to maintain a flag(binary value 0/1, true/false) mos

Random Sample Imputation

Till now we have seen techniques that were either applicable for Numerical or Categorical variables but not both. So we would like to make you familiar with a new technique that can be easily used for both the Numerical & Categorical variables.   Random Sample Imputation is the technique that is widely used for both the Numerical and Categorical Variables. Do not confuse it with Arbitrary Value Imputation , may seems to be similar by name. In fact, it's totally different. When compared based on the principle used for imputation, it is more similar to Mean/Median/Mode Imputation techniques. This technique also preserves the statistical parameter of the original variable distribution, for the missing data just like Mean/Median / Mode Imputations . Now let's go ahead and have a look at the assumptions that we need to keep in mind, advantages and the limitations of this technique, post that we will be getting our hands dirty with some code.

Missing Category Imputation

Till now, we have seen imputation techniques that could only be used for Numerical variables but didn't say anything about the Categorical variables/column.   So now, we are going to discuss a technique that is mostly used for imputing categorical variables. Missing Category Imputation is the technique in which we add an additional category for the missing value, as "Missing" in the variable/column. In simple terms we do not take the load of predicting or calculating the value(like we did for Mean/Median or End tail Imputation ), we simply put "Missing" as the value.  Now, we may have a doubt that if we are only replacing the value with "Missing" then why it is said that this method can be used for Categorical variables only?  Here is the answer, we can use it for Numerical variables also, since we can't introduce a categorical value in the Numerical variables/column, we will be required to introduce some Numerical value that is unique for the va

End of Tail Imputation

End of Tail Imputation is another important Imputation technique. This technique was developed as an enhancement or to overcome the problems in the Arbitrary value Imputation technique. In the Arbitrary values Imputation method the biggest problem was selecting the arbitrary value to impute for a variable. Thus, making it hard for the user to select the value and in the case of a large dataset, it became more difficult to select the arbitrary value every time and for every variable/column.  So, to overcome the problem of selecting the arbitrary value every time, End of Tail Imputation was introduced where the arbitrary value is automatically selecting arbitrary values from the variable distributions. Now the question comes How do we select the values? & How to Impute the End value? There is a simple rule to select the value given below:-  In the case of normal distribution of the variable, we can use Mean plus/minus 3 times the standard deviation.  In the case variable is skewed,

Imputation Techniques

Welcome to a series of articles about Imputation techniques. We will be publishing small articles(Quick Notes) about the various Imputation techniques used, their advantages, disadvantages, when to use and coding involved for them.  Not Sure What is Imputation?   &   What is Missing Data?    Why they are important. Click on the links to know more about them.   1. Mean Or Median Imputation 2.  End of tail Imputation  3. Missing Category Imputation 4. Random Sample Imputation 5. Missing Indicator Imputation 6. Mode Imputation 7. Arbitrary Value Imputation 8. Complete Case Analysis(CCA)  Python Libraries used for Quick & Easy Imputation.  09.  SimpleImputer   10. Feature Engine 11. Multi-Variate Imputation

Mean or Median Imputation

To understand Mean or Median Imputation, we need to first revise the concepts of Mean & Median. Then it would be easy for us to know why this is a widely used method for imputation and can easily identify its issues.  We already studied the Missing Data and defined Imputation & its basics in previous articles. What is Mean?  Mean is nothing but the arithmetic average of numbers. That is why it is also referred to as Average or Arithmetic Average. The process of finding average/mean is quite simple, we just add all the given values irrespective of the magnitude(+/-) and then divide the total sum by the no. of observations.                                  Sum of all the observation  Average/Mean =  ---------------------------------------                                    No. of Observations

Partitioning in HIVE - Learning by Doing

< Previous    Partitioning in Hive We studied the theory part involved in Partitioning in Hive in our previous article. Time to get our hands dirty now.  We will be following the below pattern for the Coding part:-  1. Hadoop Installation . 2. Hive Installation . 3. Static Partitioning.  {The theory part is covered in the previous article.} 4. Dynamic Partitioning. {The theory part is covered in the previous article.} Hope we have installed, and have Hadoop and Hive running. 

Partitioning in Hive

What is Partitioning? In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable. In our everyday  routine  also, we use this concept to ease out our tasks and save time. But we do it so abruptly that we hardly notice how we did it.  Let's see an example and get familiar with the concept.  Suppose we have a deck of cards and need to fetch "Jack of Spades" from the deck of cards. So now there are two ways in which we can accomplish this task. We can start turning over every card one by one, starting from the top/bottom until we reach our card. We group the deck according to suit, i.e. clubs, hearts, spades, diamonds. Now, as soon we hear "Spades", we know which group to look for, thus dividing our work 1/4 times. This grouping of our data according to some specific category reduced our work and saved energy, time and effort.  Defining in Technical Term

Getting Acquainted with NewSQL

Introduction NewSQL is a relatively new Database Management System. It is so nascent that it is still not listed as proper DBMS because the rules and regulations are still unclear.  To understand NewSQL, we need to know why the need was raised for a new Database when we already had two great and successful Databases(SQL & NoSQL).  SQL is the most widely used and most preferred database of all time. The ACID properties used for it makes it one of its kind and ranks it higher than the other.  NoSQL is another rising database that has recently gained more limelight due to the rise in Big data technologies and the need to store enormous documents/data coming from different sources. We can read more about the difference between SQL VS NoSQL . 

SQL VS NoSQL

What is SQL Structured Query Language is also known as SQL is one of the most commonly used querying languages across the world. Ever since its inception, demand for SQL is growing and has been able to find its root from a small startup to a mammoth company. One of the biggest factors for its popularity is most of the software that is present in the market are Open-Source, Easy to grasp and install.  What is NoSQL  Non-SQL or NoSQL is used for querying the data outside the traditional SQL databases. This is because of the distributed architecture. NoSQL is also known as Not Only SQL because it is can still store the data stored by traditional SQL, only the way of storing differs. They can also differ based on data models. The most common data models are document, key-value, column, and graph. Let's dive deeper and try to find the dissimilarities between the two. 

ExploriPy -- Newer ways to Exploratory Data Analysis

Introduction  ExploriPy is yet another Python library used for Exploratory Data Analysis. This library pulled our attention because it is Quick & Easy to implement also simple to grasp the basics. Moreover, the visuals provided by this library are self-explanatory and are graspable by any new user.  The most interesting part that we can't resist mentioning  is the easy grouping of the variables in different sections. This makes it more straightforward to understand and analyze our data. The Four Major sections presented are:-  Null Values Categorical VS Target Continuous VS Target Continuous VS Continuous  

The Explorer of Data Sets -- Dora

Exploring the dataset is both fun and tedious but an inevitable step for the Machine Learning journey. The challenge always stands for correctness, completeness and timely analysis of the data.  To overcome these issues lot of libraries are present, having their advantages and disadvantages. We have already discussed a few of them( Pandas profiling , dtale , autoviz , lux , sweetviz ) in previous articles. Today, we would like to present a new library for Exploratory Data Analysis --- Dora.  Saying only an EDA library would not be justified as it does not help explore the dataset but also helps to adjust data for the modelling purpose.