Skip to main content

Posts

Showing posts from September, 2021

Trimming the Outliers

Introduction  The grass is not as green as it can be seen from outside when it comes to Machine Learning or Data Science. The end result of designing a perfect hypothesised Model is rarely possible not because ML is not powerful, but there is a long tedious and repetitive work of cleaning, analysing and polishing the dataset is involved.   One thing that we need to take care of in this Cleaning and improving process is "The Outliers". This is a mere term with a simplistic meaning but is troublesome to handle/manage the data when introduced in it.  Still unaware of Outliers, How they are introduced? and How to identify them? Read it Here > Mystery of Outliers <   Let's begin with the first technique to Handle outliers. 

Outliers

Introduction Machine Learning, Data Science, Data Analytics. etc. etc. are the terms that are on hype in the current world and every individual is drawn toward these fancy fields, not only because there is a high demand for these technologies but also the things we can achieve from them.  Data is the next-generation fuel for industries, has seen a huge surge in its importance in the past few decades because with the data we can avail all the super-intelligence kind of stuff. All the super-intelligence stuff like knowing our customers better in large, predicting future events, building intelligent systems have been made possible with the data.  Thus, as we can harness the power of data, more and more industries are trying to capture as much data as they could to enhance their products/services. Hence. the demand for technologies and jobs dealing with data is on rising. This rising demand is attracting more and more individuals towards itself.  But with rising new ways to capture data, i

Multi-variate Imputation of Chained Equation

We have already studied many techniques used for Missing Data Imputation . The majority of these techniques , that we studied, are or can be used in our final production-ready model. But when it comes to imputing something then there is always a chance of getting it better cause we are never sure if the values imputed are correct or not. Thus, to improve the imputation, we use Multiple imputations , i.e using more than one way to predict the values and then taking average or any other way to get the best suitable value.  We have already seen a technique using similar logic, i.e. KNN Imputation , that uses the K-Nearest Neighbour Algorithm to find the best suitable value. These techniques are better known as " Multi-Variate Imputation ". Now, we would like to introduce you to a newer and better technique, which has now become a principal technique for Missing Data Imputation, known as MICE(Multi-variate Imputation of Chained Equation).  Multi-variate Imputation of Chained Equa

HDFS Commands Cheat Sheet

HDFS is the main hub of the Hadoop ecosystem, responsible for storing large data sets both structured & unstructured across various nodes & thereby maintaining the metadata in the form of log files. Thus, to work with such a system, we need to be well versed or at least should be aware of the common commands and processes to ease our task. In that matter, we have consolidated some of the most commonly used HDFS commands that one should know to work with HDFS. To begin with, we need to check the below list.  1. Install Hadoop 2. Run Hadoop -- we can use the  'start-all.cmd' command or start directly from the Hadoop directory.  3. Verify Hadoop services -- We can check if our Hadoop is up and running using the below command jps        Great..!!! Now we are ready to execute and learn the commands.  **Note:-  These commands are case-specific. Do take special care of capital and small letter while writing the commands. 1. version -- this command is used to know the versi