Skip to main content

Posts

Showing posts with the label preprocessing

Trimming the Outliers

Introduction  The grass is not as green as it can be seen from outside when it comes to Machine Learning or Data Science. The end result of designing a perfect hypothesised Model is rarely possible not because ML is not powerful, but there is a long tedious and repetitive work of cleaning, analysing and polishing the dataset is involved.   One thing that we need to take care of in this Cleaning and improving process is "The Outliers". This is a mere term with a simplistic meaning but is troublesome to handle/manage the data when introduced in it.  Still unaware of Outliers, How they are introduced? and How to identify them? Read it Here > Mystery of Outliers <   Let's begin with the first technique to Handle outliers. 

Outliers

Introduction Machine Learning, Data Science, Data Analytics. etc. etc. are the terms that are on hype in the current world and every individual is drawn toward these fancy fields, not only because there is a high demand for these technologies but also the things we can achieve from them.  Data is the next-generation fuel for industries, has seen a huge surge in its importance in the past few decades because with the data we can avail all the super-intelligence kind of stuff. All the super-intelligence stuff like knowing our customers better in large, predicting future events, building intelligent systems have been made possible with the data.  Thus, as we can harness the power of data, more and more industries are trying to capture as much data as they could to enhance their products/services. Hence. the demand for technologies and jobs dealing with data is on rising. This rising demand is attracting more and more individuals towards itself.  But with rising new ways to capture data, i

ExploriPy -- Newer ways to Exploratory Data Analysis

Introduction  ExploriPy is yet another Python library used for Exploratory Data Analysis. This library pulled our attention because it is Quick & Easy to implement also simple to grasp the basics. Moreover, the visuals provided by this library are self-explanatory and are graspable by any new user.  The most interesting part that we can't resist mentioning  is the easy grouping of the variables in different sections. This makes it more straightforward to understand and analyze our data. The Four Major sections presented are:-  Null Values Categorical VS Target Continuous VS Target Continuous VS Continuous  

Automatic Visualization with AutoViz

We have discussed Exploratory Data Analysis, known as EDA & have also seen few powerful libraries that we can use extensively for EDA. EDA is a key step in Machine Learning, as it provides the start point for our Machine Learning task. But, there are a lot of issues related to traditional Data Analysis techniques. There are too many new libraries coming up in the market to rectify these issues. One such API is AutoViz, which provides Quick and Easy visualization with some insights about the data.

A Sweat way to Exploratory Data Analysis --- Sweetviz

Another day, another beautiful library for Exploratory Data Analysis(EDA) . Having studied some great libraries like Lux , D-tale , pandas profiling of EDA , we are back with another great API, 'SWEETVIZ', which you can use for your Data Science Project. Introduction It is an open-source Library of Python & is still in the development phase. It already has some great features to offer, & makes it our choice to bring it for you. Its sole purpose is to visualise & analyse data Quickly. The best feature of this API is it provides an option to compare two datasets, i.e. we can compare & analyse the test vs training data together. That's not all it's, just the starting. Let's dive deeper and see what it has more to offer us. 

D-Tale -- One Stop Solution for EDA

D-Tale is a new recently launched(Feb 2020) tool for Exploratory Data Analysis. It is made up of Flask(for back-end) and React(for Front-end) providing a powerful analysing and visualizing tool.  D-Tale is a Graphical User Interface platform that is not only Quick & Easy to understand but also great fun to use. It comes with so many features packed and loaded in it that reduces the manual work of Data Engineers/Scientists analysing and understanding the data and removes the load of looking for multiple different libraries used in EDA.  Let's have a look at some features which make it so amazing:- 1. Seamless Integration -- D-tale provides seamless integration with multiple python/ipython notebooks and terminals. So, we can use it with almost any IDE of our choice. 2. Friendly UI  -- The Graphical User Interface provided by D-tale is quite simple and easy to understand, such that anybody can easily get friendly with it & start working right away.  3. Support of multiple Py

Defining, Analyzing, and Implementing Imputation Techniques

  What is Imputation? Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis. Fig 1:- Imputation Not Sure What is Missing Data? How it occurs? And its type? Have a look  HERE  to know more about it. Let’s understand the concept of Imputation from the above Fig {Fig 1}. In the above image, I have tried to represent the Missing data on the left table(marked in Red) and by using the Imputation techniques we have filled the missing dataset in the right table(marked in Yellow), without reducing the actual size of the dataset. If we notice here we have increased the column size, which is possible in Imputation(Adding “Missing” category imputation)