Exploring the dataset is both fun and tedious but an inevitable step for the Machine Learning journey. The challenge always stands for correctness, completeness and timely analysis of the data.
To overcome these issues lot of libraries are present, having their advantages and disadvantages. We have already discussed a few of them(Pandas profiling, dtale, autoviz, lux, sweetviz) in previous articles. Today, we would like to present a new library for Exploratory Data Analysis --- Dora.
Saying only an EDA library would not be justified as it does not help explore the dataset but also helps to adjust data for the modelling purpose.
Introduction
Being an open-source library, Dora the library is easily available to all. Unlike the other libraries we discussed earlier, Dora also has some special features apart from Exploring the dataset and creating visuals.
Features offered by Dora:-
- Read Configuration & Data
- Clean the Data
- Feature Selection & Extraction
- Data Visualization
- Validating Model
- Data Versioning
Too much to discuss, so let's dive directly into the installation.
Installation
We will be using Jupyter notebook for the entire purpose you; may use any other IDE of your choice also.
## conda installation
conda install dora
## pip installation
pip install Dora
## Jupyter Notebook installation
pip install Dora
Let's grab a coffee by the time it gets installed.
Installing Dora |
Once the library is installed, we might be asked to restart the kernel(in the case of Jupyter Notebook) to reflect the changes.
Great we have installed our Dora library. And good to go with some examples which will help us understand it better.
Getting Started
Read Configuration & Data
Reading Data Directly from Dora
Reading Data Directly from Dora |
Reading data using Pandas
Cleaning Data
To get ready with a clean data for our Machine learning model we need to perform 2 basic steps :- Impute missing Data, Scale the Data. Amazingly these are very well handled by this library.
Creating Dummy Data
Imputing Missing Values
Imputing Missing Values |
Scaling the inputs
Scaling the inputs |
Feature Selection & Extraction
Feature selection is the process in which select features of utmost important and discard the less important feeatures.
Removing a Feature
Removing a Feature |
One Hot Encoding
One Hot Encoding |
Creating new variable
Creating new variable |
Data Visualization
We may also use dora.explore() command for visualizing graphs related to all columns.
Model Validation
There are lot of steps involved in Validating and creating a model. The first step is to split our dataset into 2 differents parts namely Train Set and Test Set.
Dora provides a simple function to do so (shown below). By default we will get a 80/20 split of Traing and Test Data.
Model Validation |
Thanks for the Appreciation Aaron.
ReplyDelete