Skip to main content

The Explorer of Data Sets -- Dora


Exploring the dataset is both fun and tedious but an inevitable step for the Machine Learning journey. The challenge always stands for correctness, completeness and timely analysis of the data. 

To overcome these issues lot of libraries are present, having their advantages and disadvantages. We have already discussed a few of them(Pandas profiling, dtale, autoviz, lux, sweetviz) in previous articles. Today, we would like to present a new library for Exploratory Data Analysis --- Dora. 


Saying only an EDA library would not be justified as it does not help explore the dataset but also helps to adjust data for the modelling purpose.


Introduction


Being an open-source library, Dora the library is easily available to all. Unlike the other libraries we discussed earlier, Dora also has some special features apart from Exploring the dataset and creating visuals. 


Features offered by Dora:- 


  • Read Configuration & Data
  • Clean the Data
  • Feature Selection & Extraction
  • Data Visualization
  • Validating Model
  • Data Versioning


Too much to discuss, so let's dive directly into the installation.


Installation


We will be using Jupyter notebook for the entire purpose you; may use any other IDE of your choice also.


## conda installation

conda install dora


## pip installation 

pip install Dora


## Jupyter Notebook installation

pip install Dora



Let's grab a coffee by the time it gets installed.


Installing Dora

Once the library is installed, we might be asked to restart the kernel(in the case of Jupyter Notebook) to reflect the changes.


Great we have installed our Dora library. And good to go with some examples which will help us understand it better.


Getting Started



Read Configuration & Data


We can read data using two different ways.. either directly using dora's inbuilt function OR using pandas(traditional way)

Reading Data Directly from Dora



Reading Data Directly from Dora


Reading data using Pandas


Reading data using Pandas




Cleaning Data

To get ready with a clean data for our Machine learning model we need to perform 2 basic steps :- Impute missing Data, Scale the Data. Amazingly these are very well handled by this library. 



Creating Dummy Data


Creating Dummy Data



Imputing Missing Values

Imputing Missing Values



Scaling the inputs

Scaling the inputs




Feature Selection & Extraction


Feature selection is the process in which select features of utmost important and discard the less important feeatures.


Removing a Feature


Removing a Feature


One Hot Encoding


One Hot Encoding


Creating new variable


Creating new variable




Data Visualization






We may also use dora.explore() command for visualizing graphs related to all columns.





Model Validation


There are lot of steps involved in Validating and creating a model. The first step is to split our dataset into 2 differents parts namely Train Set and Test Set


Dora provides a simple function to do so (shown below). By default we will get a 80/20 split of Traing and Test Data. 


Model Validation


Once we get our train & test datasets, next task is to create a model and fit our train to it. 

some_model.fit(X_train, y_train)

Once we fit our data to model i.e. our model is trained we need to test it over the train set. 

some_model.score(X_test, y_test)


Data Versioning 

Versioning is the process of creating various checkpoints when we are performing very large manipulation and don't want to start over from the beginning in case of mistakes. 

There are various ways we perform versioning. Some people prefer creating a copy of the data at various points, few prefer storing it somewhere in system such as DB or file which they can refer at later stage. But all this is a tedious process as to remembering all the checkpoints that were creating and keeping them in same format. 

To over come this issue we can use inbuit functions of dora for Easy life and Quick recovery of data. 

dora.snapshot("Checkpoint Name")
dora.use_snapshot("Checkpoint Name")

We can use dora.snapshot method to create snapshot of data & dora.use_snapshot to refer to any particular checkpoint.

Summary

We tried to understand the basics of a new library for EDA., How to install & Use it. Compared to other Python libraries used for EDA, Dora holds a strong position, as we can get a lot more than the simple data visualization. Also, the feature engineering part makes it a better choice. 


Till then, What are you waiting for go, ahead download it and start playing with it, and share your views, issues and suggestions. 


That's all from here. Until then, This is the Quick DataScience Team providing a Quick and Easy guide/insight of another DataScience topic. 

Comments

Post a Comment