Skip to main content

EDA ---- Exploratory Data Analysis

EDA



EDA - Exploratory Data Analysis is the technique of defining, analyzing and investigate the dataset. This technique is used by most data scientists, engineers and everyone who is related to or wants to work and analyze the data.

Saying that, it includes the whole majority of us as at any point of time we are dealing with data and we un-knowingly do an initial analysis about which in technical terms is referred to as  "Exploratory Data Analysis".


Here is a formal definition of the EDA:- 


In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. 


Still confused about how every one of using this process..!! Let me explain it with a simple example...

Suppose you and your group plan for lunch in a restaurant... as soon as we hear "lunch" and "restaurant" our mind starts creating a list of all the known places, next as someone in the group starts speaking about a dish or a drink that they would like to have our minds starts mapping the dishes to the restaurants and we can visualize the place and its taste now.

 

This mapping and visualization that we did right now are termed Exploratory Data Analysis, apart from this it also includes explaining our point/data with graphs, plots, charts, visuals, text so systematically and in an organized way that it becomes eye-catching and self-explanatory. 


Great...!!! We are getting clear with the basics of "EDA". 


Overview of EDA

Let's have an overview of our EDA and try to understand it more clearly, quickly and easily... 


History:-  Yes, this piece is important... to know about the future of anything we should know about its past... 

So, the term Exploratory Data Analysis (EDA) was coined by John W. Tukey in his book (Exploratory Data Analysis in 1977) to encourage statisticians to explore the data and formulate hypotheses that could lead to new data collection and experiments. 

Focus Area:- While doing the Exploratory Data Analysis we need to keep the following points in our mind:- 


  • it should be able to highlight important variables
  • it should define the structure of the dataset
  • it should detect any anomalies in our dataset like outliers, wrong values etc...
  • it should be able to maximize the insights drawn or to be drawn from the dataset
  • it should be self-explanatory
  • it should be systematic and organized
  • it should test any assumptions made on which our further analysis will be based
  • it should provide a basis for further data collection(survey, experiments etc.)


it seems to be too much... but in short it should be like a base on which data scientist can model their future analysis and build models.

Need:-   We need EDA because blandly start working on the dataset and trying to make a model is not a job to be done... The process is to understand the data, knowing about the variable, data and its source, knowing what we want to analyze and what information we can get from the data and how we can use it more efficiently so that we can create a model quickly and easily. Thus, to achieve all this we need to run few basic checks by creating visuals, defining the data and other activities that help in understanding the data well. 


If we have to summarize it creating the end ML Models is just a small part of the journey, the major portion is data preprocessing and understanding the data. So, the better we can express our dataset the more chances of creating a better model.


Process:- The EDA is not a technique but --an approach an attitude and philosophy about how a data analysis should be carried out.


Classification of EDA

the EDA approach can be further cross-classified into the following ways:-


  1. Graphical 
  2. Non-Graphical

And


  1. Univariate
  2. Multivariate


Let's have a quick review and define these techniques.


P.S. an elaborate discussion for these techniques are defined in separate articles and link for the same at end of each section.


1. Graphical Techniques:-  These are the most important part of the EDA approach, it is so because when we talk about EDA the first thing that we get in mind is creating visuals like graphs, plots, charts, maps, trees etc. to define our dataset. This technique and the ways to implement it are defined here in this article.

 

2. Non-Graphical Techniques:- These are often not much considered as part of the EDA approach, but are an important part of it. Unlike Graphical techniques, these do not deal with visuals for analyzing the dataset but focus more on defining and analyzing the data with mathematical techniques like defining and summarizing the dataset, statistical analysis, theoretical analysis etc.


3. Univariate Technique:- This method focuses on analyzing the data based on a single variable(column) at a particular time. These are important as they help in focusing on a single variable and knowing in-depth about it.


4. Multivariate Technique:- This method focuses on analyzing the data based on multiple variables (columns) at a particular time. These are important as they help in analyzing the particular variable concerning other variables in our dataset. 


Thus, on combining these techniques we can get the final list of techniques used in EDA as follows:- 


  1. Univariate non-graphical
  2. Multivariate Non-Graphical
  3. Univariate Graphical
  4. Multivariate Graphical.


Conclusion

To summarize we have studied the EDA, what it is, why it is important and the techniques which are used in it. 


To know more about the EDA- Techniques and the different libraries used Read Here.


Comments

Post a Comment