Skip to main content

Posts

Showing posts with the label outliers

Feature Scaling -- Standardization

In our previous article/blog we had an overview of Feature Scaling . We saw What is Feature Scaling and How we can use it for our benefit. Basically, we discussed all the theory related to Feature Scaling, lets move ahead and see What are the various ways to achieve Feature Scaling and How we can implement them.  The very first and most important technique is " Standardization " also known as " Z-Score Normalization ".  The basic idea of this technique is to subtract the mean from each value and divide it by the standard deviation. Doing so will centre the data around its mean with unit standard deviation.  Formula Used:-  standardization formula This formula is also known as Z-Score, hence the name  Z-Score Normalization.  What is Z-Score? Internet Defines it as:-  A Z-score is a numerical measurement that describes a value's relationship to the mean of a group o

Outliers Capping

Introduction  In the past few articles, we have seen about Outliers, What are they, How they are introduced and discussed few techniques how to handle these outliers in our dataset.  Another technique that is widely used while handling outliers is capping the data . Capping means defining the limits for a field.  Capping in a sense is similar to trimming the dataset, but the difference here is, while trimming we used IQR or z-score and trimmed the data based on some IQR or z-score value. Here instead of trimming or removing the values from the dataset, we convert the outliers and bring them in the limit or range of our data.

Trimming the Outliers

Introduction  The grass is not as green as it can be seen from outside when it comes to Machine Learning or Data Science. The end result of designing a perfect hypothesised Model is rarely possible not because ML is not powerful, but there is a long tedious and repetitive work of cleaning, analysing and polishing the dataset is involved.   One thing that we need to take care of in this Cleaning and improving process is "The Outliers". This is a mere term with a simplistic meaning but is troublesome to handle/manage the data when introduced in it.  Still unaware of Outliers, How they are introduced? and How to identify them? Read it Here > Mystery of Outliers <   Let's begin with the first technique to Handle outliers. 

Outliers

Introduction Machine Learning, Data Science, Data Analytics. etc. etc. are the terms that are on hype in the current world and every individual is drawn toward these fancy fields, not only because there is a high demand for these technologies but also the things we can achieve from them.  Data is the next-generation fuel for industries, has seen a huge surge in its importance in the past few decades because with the data we can avail all the super-intelligence kind of stuff. All the super-intelligence stuff like knowing our customers better in large, predicting future events, building intelligent systems have been made possible with the data.  Thus, as we can harness the power of data, more and more industries are trying to capture as much data as they could to enhance their products/services. Hence. the demand for technologies and jobs dealing with data is on rising. This rising demand is attracting more and more individuals towards itself.  But with rising new ways to capture data, i

EDA Techniques

We had a look over the basics of EDA in our previous article  EDA - Exploratory Data Analysis . So now let's move ahead and look at how we can automate the process and the various APIs used for the same. We will be focusing on the 7 major libraries that can be used for the same. These are our personal favourites & we prefer to use them most of the time.  We will look into the libraries' & will cover the install, load, and analyse parts for each separately.  D-tale Pandas - Profiling Lux Sweetviz Autoviz ExploriPy Dora

Defining, Analyzing, and Implementing Imputation Techniques

  What is Imputation? Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis. Fig 1:- Imputation Not Sure What is Missing Data? How it occurs? And its type? Have a look  HERE  to know more about it. Let’s understand the concept of Imputation from the above Fig {Fig 1}. In the above image, I have tried to represent the Missing data on the left table(marked in Red) and by using the Imputation techniques we have filled the missing dataset in the right table(marked in Yellow), without reducing the actual size of the dataset. If we notice here we have increased the column size, which is possible in Imputation(Adding “Missing” category imputation)

Missing Data -- Understanding The Concepts

  Introduction Machine Learning seems to be a big fascinating term, which attracts a lot of people towards it, and knowing what all we can achieve through it makes the sci-fi imagination of ours jump to another level. No doubt in it, it is a great field and we can achieve everything from an automated reply system to a house cleaning robots, from recommending a movie or a product to help in detecting disease. Most of the things that we see today have already started using ML to better themselves. Though building a model is quite easy, the most challenging task is preprocessing the data and filtering out the Data of Use. So, here I am going to address one of the biggest and common issues that we face at the start of the journey of making a Good ML Model, which is  The   Missing Data . Missing Data can cause many issues and can lead to wrong predictions of our model, which looks like our model failed and started over again. If I have to explain in simple terms, data is like Fuel of our Mo