EDA |
EDA - Exploratory Data Analysis is the technique of defining, analyzing and investigate the dataset. This technique is used by most data scientists, engineers and everyone who is related to or wants to work and analyze the data.
Saying that, it includes the whole majority of us as at any point of time we are dealing with data and we un-knowingly do an initial analysis about which in technical terms is referred to as "Exploratory Data Analysis".
Here is a formal definition of the EDA:-
In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
Still confused about how every one of using this process..!! Let me explain it with a simple example...
Suppose you and your group plan for lunch in a restaurant... as soon as we hear "lunch" and "restaurant" our mind starts creating a list of all the known places, next as someone in the group starts speaking about a dish or a drink that they would like to have our minds starts mapping the dishes to the restaurants and we can visualize the place and its taste now.
This mapping and visualization that we did right now are termed Exploratory Data Analysis, apart from this it also includes explaining our point/data with graphs, plots, charts, visuals, text so systematically and in an organized way that it becomes eye-catching and self-explanatory.
Great...!!! We are getting clear with the basics of "EDA".
Overview of EDA
Let's have an overview of our EDA and try to understand it more clearly, quickly and easily...
So, the term Exploratory Data Analysis (EDA) was coined by John W. Tukey in his book (Exploratory Data Analysis in 1977) to encourage statisticians to explore the data and formulate hypotheses that could lead to new data collection and experiments.
Focus Area:- While doing the Exploratory Data Analysis we need to keep the following points in our mind:-
- it should be able to highlight important variables
- it should define the structure of the dataset
- it should detect any anomalies in our dataset like outliers, wrong values etc...
- it should be able to maximize the insights drawn or to be drawn from the dataset
- it should be self-explanatory
- it should be systematic and organized
- it should test any assumptions made on which our further analysis will be based
- it should provide a basis for further data collection(survey, experiments etc.)
it seems to be too much... but in short it should be like a base on which data scientist can model their future analysis and build models.
If we have to summarize it creating the end ML Models is just a small part of the journey, the major portion is data preprocessing and understanding the data. So, the better we can express our dataset the more chances of creating a better model.
Classification of EDA
the EDA approach can be further cross-classified into the following ways:-
- Graphical
- Non-Graphical
And
- Univariate
- Multivariate
Let's have a quick review and define these techniques.
1. Graphical Techniques:- These are the most important part of the EDA approach, it is so because when we talk about EDA the first thing that we get in mind is creating visuals like graphs, plots, charts, maps, trees etc. to define our dataset. This technique and the ways to implement it are defined here in this article.
2. Non-Graphical Techniques:- These are often not much considered as part of the EDA approach, but are an important part of it. Unlike Graphical techniques, these do not deal with visuals for analyzing the dataset but focus more on defining and analyzing the data with mathematical techniques like defining and summarizing the dataset, statistical analysis, theoretical analysis etc.
3. Univariate Technique:- This method focuses on analyzing the data based on a single variable(column) at a particular time. These are important as they help in focusing on a single variable and knowing in-depth about it.
4. Multivariate Technique:- This method focuses on analyzing the data based on multiple variables (columns) at a particular time. These are important as they help in analyzing the particular variable concerning other variables in our dataset.
Thus, on combining these techniques we can get the final list of techniques used in EDA as follows:-
- Univariate non-graphical
- Multivariate Non-Graphical
- Univariate Graphical
- Multivariate Graphical.
Conclusion
To summarize we have studied the EDA, what it is, why it is important and the techniques which are used in it.
To know more about the EDA- Techniques and the different libraries used Read Here.
Yprotcaequihi Brenda Anderson https://wakelet.com/wake/Jx8oT6nTD1S2HjdDJDyle
ReplyDeletertherynmensio
This comment has been removed by a blog administrator.
ReplyDeletefratserWclyspa-1978 Jacobi Greene get
ReplyDeleteAwesome
arnolegat