The Explorer of Data Sets -- Dora

Exploring the dataset is both fun and tedious but an inevitable step for the Machine Learning journey. The challenge always stands for correctness, completeness and timely analysis of the data.

To overcome these issues lot of libraries are present, having their advantages and disadvantages. We have already discussed a few of them(Pandas profiling, dtale, autoviz, lux, sweetviz) in previous articles. Today, we would like to present a new library for Exploratory Data Analysis --- Dora.

Saying only an EDA library would not be justified as it does not help explore the dataset but also helps to adjust data for the modelling purpose.

Introduction

Being an open-source library, Dora the library is easily available to all. Unlike the other libraries we discussed earlier, Dora also has some special features apart from Exploring the dataset and creating visuals.

Features offered by Dora:-

Read Configuration & Data
Clean the Data
Feature Selection & Extraction
Data Visualization
Validating Model
Data Versioning

Too much to discuss, so let's dive directly into the installation.

Installation

We will be using Jupyter notebook for the entire purpose you; may use any other IDE of your choice also.

## conda installation

conda install dora

## pip installation

pip install Dora

## Jupyter Notebook installation

pip install Dora

Let's grab a coffee by the time it gets installed.

Installing Dora

Once the library is installed, we might be asked to restart the kernel(in the case of Jupyter Notebook) to reflect the changes.

Great we have installed our Dora library. And good to go with some examples which will help us understand it better.

Getting Started

Read Configuration & Data

We can read data using two different ways.. either directly using dora's inbuilt function OR using pandas(traditional way)

Reading Data Directly from Dora

Reading data using Pandas

Cleaning Data

To get ready with a clean data for our Machine learning model we need to perform 2 basic steps :- Impute missing Data, Scale the Data. Amazingly these are very well handled by this library.

Creating Dummy Data

Imputing Missing Values

Scaling the inputs

Feature Selection & Extraction

Feature selection is the process in which select features of utmost important and discard the less important feeatures.

Removing a Feature

One Hot Encoding

Creating new variable

Data Visualization

We may also use dora.explore() command for visualizing graphs related to all columns.

Model Validation

There are lot of steps involved in Validating and creating a model. The first step is to split our dataset into 2 differents parts namely Train Set and Test Set.

Dora provides a simple function to do so (shown below). By default we will get a 80/20 split of Traing and Test Data.

Model Validation

Once we get our train & test datasets, next task is to create a model and fit our train to it.

some_model.fit(X_train, y_train)

Once we fit our data to model i.e. our model is trained we need to test it over the train set.

some_model.score(X_test, y_test)

Data Versioning

Versioning is the process of creating various checkpoints when we are performing very large manipulation and don't want to start over from the beginning in case of mistakes.

There are various ways we perform versioning. Some people prefer creating a copy of the data at various points, few prefer storing it somewhere in system such as DB or file which they can refer at later stage. But all this is a tedious process as to remembering all the checkpoints that were creating and keeping them in same format.

To over come this issue we can use inbuit functions of dora for Easy life and Quick recovery of data.

dora.snapshot("Checkpoint Name")

dora.use_snapshot("Checkpoint Name")

We can use dora.snapshot method to create snapshot of data & dora.use_snapshot to refer to any particular checkpoint.

Summary

We tried to understand the basics of a new library for EDA., How to install & Use it. Compared to other Python libraries used for EDA, Dora holds a strong position, as we can get a lot more than the simple data visualization. Also, the feature engineering part makes it a better choice.

Till then, What are you waiting for go, ahead download it and start playing with it, and share your views, issues and suggestions.

That's all from here. Until then, This is the Quick DataScience Team providing a Quick and Easy guide/insight of another DataScience topic.

QuickDataScience | Quick & Easy Data Science

Search This Blog

The Explorer of Data Sets -- Dora

Introduction

Installation

Getting Started

Read Configuration & Data

Reading Data Directly from Dora

Reading data using Pandas

Cleaning Data

Creating Dummy Data

Imputing Missing Values

Scaling the inputs

Feature Selection & Extraction

Removing a Feature

One Hot Encoding

Creating new variable

Data Visualization

Model Validation

Data Versioning

Summary

Labels

Comments

Post a Comment