Skip to main content

Posts

Showing posts with the label data handling

EDA Techniques

We had a look over the basics of EDA in our previous article  EDA - Exploratory Data Analysis . So now let's move ahead and look at how we can automate the process and the various APIs used for the same. We will be focusing on the 7 major libraries that can be used for the same. These are our personal favourites & we prefer to use them most of the time.  We will look into the libraries' & will cover the install, load, and analyse parts for each separately.  D-tale Pandas - Profiling Lux Sweetviz Autoviz ExploriPy Dora

D-Tale -- One Stop Solution for EDA

D-Tale is a new recently launched(Feb 2020) tool for Exploratory Data Analysis. It is made up of Flask(for back-end) and React(for Front-end) providing a powerful analysing and visualizing tool.  D-Tale is a Graphical User Interface platform that is not only Quick & Easy to understand but also great fun to use. It comes with so many features packed and loaded in it that reduces the manual work of Data Engineers/Scientists analysing and understanding the data and removes the load of looking for multiple different libraries used in EDA.  Let's have a look at some features which make it so amazing:- 1. Seamless Integration -- D-tale provides seamless integration with multiple python/ipython notebooks and terminals. So, we can use it with almost any IDE of our choice. 2. Friendly UI  -- The Graphical User Interface provided by D-tale is quite simple and easy to understand, such that anybody can easily get friendly with it & start working right away.  3. Support of multiple Py

SQL --- Structured Query Language

  What is SQL? Structured Query Language is also known as SQL is the database language and is one of the most famous and in-demand technology.  This language was specially developed for database management i.e. creating a database, inserting and updating records in them, managing accesses and retrieving data from it. SQL is mostly used for Relational Database Management Systems.  Its demand is increasing every single day. As there is an increase in data, demand and need for SQL increases. It is been used by web developers, data analysts, data engineers, and in every other field where we need to store and retrieve data.  One of the main reasons why SQL is gaining popularity is that it is simple, easy, quick, and powerful. Another reason is that the most commonly used version of SQL(MySQL) is open-source(FREE) Another great feature of  SQL is Non Procedural language(explained in the next section). 

Defining, Analyzing, and Implementing Imputation Techniques

  What is Imputation? Imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset. These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extend, which not only raises concerns for biasing the dataset but also leads to incorrect analysis. Fig 1:- Imputation Not Sure What is Missing Data? How it occurs? And its type? Have a look  HERE  to know more about it. Let’s understand the concept of Imputation from the above Fig {Fig 1}. In the above image, I have tried to represent the Missing data on the left table(marked in Red) and by using the Imputation techniques we have filled the missing dataset in the right table(marked in Yellow), without reducing the actual size of the dataset. If we notice here we have increased the column size, which is possible in Imputation(Adding “Missing” category imputation)

Missing Data -- Understanding The Concepts

  Introduction Machine Learning seems to be a big fascinating term, which attracts a lot of people towards it, and knowing what all we can achieve through it makes the sci-fi imagination of ours jump to another level. No doubt in it, it is a great field and we can achieve everything from an automated reply system to a house cleaning robots, from recommending a movie or a product to help in detecting disease. Most of the things that we see today have already started using ML to better themselves. Though building a model is quite easy, the most challenging task is preprocessing the data and filtering out the Data of Use. So, here I am going to address one of the biggest and common issues that we face at the start of the journey of making a Good ML Model, which is  The   Missing Data . Missing Data can cause many issues and can lead to wrong predictions of our model, which looks like our model failed and started over again. If I have to explain in simple terms, data is like Fuel of our Mo

Spark — How to install in 5 Steps in Windows 10

 An easy to go guide for installing the Spark in Windows 10. Image taken from Google images 1. Prerequisites Hardware Requirement * RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work. * CPU — Min. Quad-core, with at least 1.80GHz JRE 1.8   —   Offline installer for JRE  Java Development Kit — 1.8   A Software for Un-Zipping like   7Zip   or   Win Rar * I will be using 64-bit windows for the process, please check and download the version supported by your system x86 or x64 for all the software. Hadoop * I am using Hadoop-2.9.2, you can also use any other STABLE version for Hadoop.  * If you don’t have Hadoop, you can refer to installing it from   Hadoop: How to install in 5 Steps in Windows 10 . MySQL Query Browser Download Spark Zip * I am using Spark 3.1.1, you can also use any other STABLE version for Spark. * Latest release of Spark is 3.1.2(shown in the image below) released in June'21 Fig 1:- Download Spark-3.1.2