Skip to main content

Pandas Profiling -- A Unique way to Data Analysis


Source: Google Images


Pandas Profiling is an Open-Source Library of Python. It focuses on easing out the process of initial data analysis, by providing a tool to perform the analysis of our data Quick & Easy.

It's also considered a major EDA library, creating visuals, graphs, data profiling reports, pandas reports within seconds, in just a line of code.

It saves a lot of time, which is usually lost in visualizing & understanding the data. It extends the pandas data frame to create a report for Quick & Easy Data Analysis.

Installation 

" The best way to learn is by doing. "

So, let's open a Jupyter Notebook(you may use your favourite IDE) and begin with installing the library.

Code:- 

## conda installation

conda install pandas-profiling

## pip installation 

pip install  pandas-profiling

## Jupyter Notebook installation

pip install pandas-profiling


It takes few minutes to install. Let's grab a coffee by then. 




Once the library is installed, we might be asked to restart the kernel(in the case of Jupyter Notebook) to reflect the changes.

Great we have installed our Pandas profiling library. And good to go with some pandas profiling examples which will help us understand it better.

Loading the Library

Once installed, the next step is to use the library and explore it.

for loading the pandas' profile report we have 2 different ways, but before that, we need to import our dataset  

Code:- 

*Please Note:- We prefer using Titanic Dataset as our first dataset for analysis.

Importing pandas profiling library and dataset

Creating Pandas Profile


Once we have imported the dataset then we can view the profile report in the following ways:- 

1. Through Widget:- 

Code:-       profile.to_widgets()

Pandas Profile in widget

Interesting... !!! isn't it... Please wait I will explain the pandas' profile report in the next section.

2. Through iframe:- 

Code:-   profile.to_notebook_iframe()        OR       profile    OR     df.profile_report()

We can use any of the above commands to view the report in an iframe. 

Pandas Profile in iframe

  

Now, we may wonder what is the difference between these two ways, a short answer is there is no difference, iframe produces the whole report and we need to scroll to each section, on the other hand, the widget produces it in sub-sections which gets easier to analyse without much scrolling.

Explaining the Profiling Report

The profiling report is quite compact and carries a basic detailed report about the dataset. The report is divided into 6 major sections. A basic overview of each section is described below:- 

 

Sections in Profiling report

1. Overview:- 

The first section of the pandas' profile report(also known as df report) is The Overview section, which has 3 subsections: overview, warnings, and reproduction. 

The Overview subsection describes the basic details about the dataset like the Number of variables, values, cells with missing values, duplicate rows/columns, and numerical and categorical variables.  

Overview Subsection

The Warning subsection shows the warnings about cardinality, correlation, missing data, zeros and uniform distribution. This saves time in looking for individual correlations and helps in getting a basic idea about How to proceed further in our analysis.

Warning Subsection

The third subsection is Reproduction, which has data about the pandas profiling report the time it took to generate the report, the library version used, and the option to download the configuration JSON.

Reproduction Subsection

2 Variables:- 

The second section of the report is solely dedicated to the analysis of variables separately. 

Variables Section

Here we can see all the columns of our dataset and expand the section of the variable we want to analyse.  I am expanding & analysing the "AGE" variable as I need to show some interesting features of this section. 

Variable Analysis

Noticed the 'Red Arrow' in the above image, Yes... that is a toggle button, which has some interesting analysis hidden for us. 

Toggle Variable Analysis

On clicking the 'Toggle Details' button, we get 4 subsections in each variable, which again has 4 subsections(not diving into much detail) 

First is the Statistics section which gives provides all the stats concerning that particular variable(shown above image) i.e. Quantiles(Q1 & Q3), Mean, Median, Mode, Kurtosis, Skew, Min & Max value, Variance, standard deviation etc. 

Second Section shows a histogram of the values and their frequency a similar data can be seen in the third section(Common Values) also, where the frequency for each value is shown separately. 

Histogram & Common Values

The Last section shows the Min. & Max values in the dataset for that variable and the frequency of each. 

3 Interactions:- 

This is another interesting section of the pandas profiling report, that provides the variables and we can use them to create a relational graph between any two. 

Graphs in Interaction Section

4. Correlations:- 

A separate section is also provided apart from the "Warning" section to visualize the correlations among the variables, 5 different correlations i.e. Pearson's, Kendall's, Spearman's, Phik's and Cramer's. Also, if we want to know in detail about each correlation, we can use the 'Toggle' button marked by the Red arrow. 


Correlations among variables


5. Missing Values:- 

This is the 5th section of our report that has deep insight into the missing data shown with the help of 4 different graphs: histogram, heatmap, dendrograms and matrix.

Missing Values report

6 Sample:- 

this is nothing, but the df.head() and df.tail() of the dataset.

Dataset Sample


That's a huge lot of features in a small pack. But let's have a look at some of the disadvantages that it has as of now, we hope them to removed soon.

Disadvantages

An overview of the disadvantages of the pandas-profiling library. As per our analysis, we found these 3 major disadvantages of this library:- 

1. Allows data frames only:- this is one of the major drawbacks of this library, as it allows only pandas data frame to be used for Analysis. 

2. Huge datasets:- another big issue with this library is, performance. This library is good for small datasets, but as the size of data increases, there is a significant drop in the performance of the profiling reports. The only choice that is left is to use it over part of the data.

3. Extension of describe:- df.describe is the most basic function of the pandas' data frame that is used to find initial analysis about the data, this library can be thought of as its extension only as it provides most of the information covered in basic pandas data frame function like df.head, df.tail, df.describe etc.

Summary

We talked about the pandas profiling library, which is used for Exploratory data analysis. We talked about How to install it and use it. Also, we discussed the report and its sections in brief. 

Apart from that we also discussed the disadvantages, that limits its use but still remains a good tool for basic Exploratory Data Analysis.

That's all from here... until then This is the Quick DataScience Team providing a Quick and Easy guide/insight of another DataScience topic. 






Comments

  1. Hi, I think your blog might be having browser compatibility issues. When I look at your website in Opera, it looks fine but when opening in Internet Explorer, it has some overlapping. I just wanted to give you a quick heads up! Other then that, fantastic blog!paper price index

    ReplyDelete

Post a Comment