QuickDataScience | Quick & Easy Data Science

Posts

Showing posts with the label big data

Data Visualization — IPL Data Set (Part 2)

Welcome to the 3rd Post in the series of Data Visualization, one of the most loved/followed topics of India — IPL (Indian Premier League) (Part 2) 2008–2020. In Part1 we did an analysis based on the Teams , here we will be doing analysis based on all other fields and try to cover some very interesting and unique analyses. Overview of the Data Set Description of columns of IPL Dataset -1 Description of columns of IPL Dataset -2 Let’s Begin by Checking Data in these columns... IPL Data Set 1 Overview IPL Data Set 2 Overview Let’s begin with some visualization and finding the top 10 players, by analyzing the No. of MoM(Man of the Match) awards achieved. MoM Awards The above graph shows the top 10 players of IPL with the most number of Man of the Match Awards… and guess what… it's none other than our Mr. 360 (ABD) with 23 awards followed by The Universe Boss (Gayle 333) 22 awards, roHIT MAN of India with 18, Warner and Captain Cool (MSD) with 17 each. Just a random thought of checkin...

Data Visualization — IPL Data Set (Part 1)

Welcome to the 2nd Post in the series of Data Visualization, one of the most loved/followed topics of India — IPL (Indian Premier League) (Part 1) In this, we will be focusing on the various analysis based on the Teams. Overview of the Data Set Description of columns of IPL Dataset Let’s Begin by Checking Data in these columns IPL Data Set 1 Overview IPL Data Set 2 Overview Moving towards the most interesting part, Visualize the dataset and relations. Let's begin by having a look at the total wins by each Team since 2008... Team VS No. of Match Wins The above Bar chart shows the top 5 teams with the most number of Match Wins across all the seasons. Surprisingly, RCB in among the top 5 still hasn’t won any IPL Season. Now Let’s have a look at these wins Team VS Wins based on runs/wickets The above Bar chart is a detailed version of the previous graph which shows the top 5 teams with the most wins divided by wins achieved batting first and batting second. Blue Bars represent the wins...

Bucketing in Hive

Today, we are dealing with a big problem of Big Data, where a huge amount of data is generated every second and minute. Thus, the issue of storing such a huge amount of data arises, which is managed using various SQL, NoSQL and now NewSQL databases. But still, a problem remains if we store the data as it is generated in our databases, it gets difficult to query such huge data. Thus, there was a need for some technique that could help in splitting the data at the time of storing, providing not only fast and easy access to data but also in easy storage. To cater for the issue of storing and managing Big Data, Hive was introduced, which further provides concepts like Partitioning and Bucketing to solve the issue of storing and querying huge datasets.

Partitioning in HIVE - Learning by Doing

< Previous Partitioning in Hive We studied the theory part involved in Partitioning in Hive in our previous article. Time to get our hands dirty now. We will be following the below pattern for the Coding part:- 1. Hadoop Installation . 2. Hive Installation . 3. Static Partitioning. {The theory part is covered in the previous article.} 4. Dynamic Partitioning. {The theory part is covered in the previous article.} Hope we have installed, and have Hadoop and Hive running.

Partitioning in Hive

What is Partitioning? In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable. In our everyday routine also, we use this concept to ease out our tasks and save time. But we do it so abruptly that we hardly notice how we did it. Let's see an example and get familiar with the concept. Suppose we have a deck of cards and need to fetch "Jack of Spades" from the deck of cards. So now there are two ways in which we can accomplish this task. We can start turning over every card one by one, starting from the top/bottom until we reach our card. We group the deck according to suit, i.e. clubs, hearts, spades, diamonds. Now, as soon we hear "Spades", we know which group to look for, thus dividing our work 1/4 times. This grouping of our data according to some specific category reduced our work and saved energy, time and effort. Defining in Technical Term...

SQL --- Structured Query Language

What is SQL? Structured Query Language is also known as SQL is the database language and is one of the most famous and in-demand technology. This language was specially developed for database management i.e. creating a database, inserting and updating records in them, managing accesses and retrieving data from it. SQL is mostly used for Relational Database Management Systems. Its demand is increasing every single day. As there is an increase in data, demand and need for SQL increases. It is been used by web developers, data analysts, data engineers, and in every other field where we need to store and retrieve data. One of the main reasons why SQL is gaining popularity is that it is simple, easy, quick, and powerful. Another reason is that the most commonly used version of SQL(MySQL) is open-source(FREE) Another great feature of SQL is Non Procedural language(explained in the next section).

Spark — How to install in 5 Steps in Windows 10

An easy to go guide for installing the Spark in Windows 10. Image taken from Google images 1. Prerequisites Hardware Requirement * RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work. * CPU — Min. Quad-core, with at least 1.80GHz JRE 1.8 — Offline installer for JRE Java Development Kit — 1.8 A Software for Un-Zipping like 7Zip or Win Rar * I will be using 64-bit windows for the process, please check and download the version supported by your system x86 or x64 for all the software. Hadoop * I am using Hadoop-2.9.2, you can also use any other STABLE version for Hadoop. * If you don’t have Hadoop, you can refer to installing it from Hadoop: How to install in 5 Steps in Windows 10 . MySQL Query Browser Download Spark Zip * I am using Spark 3.1.1, you can also use any other STABLE version for Spark. * Latest release of Spark is 3.1.2(shown in the image below) released in June'21 Fig 1:- Download Spark-...

SQOOP — How to install in 5 Steps in Windows 10

An easy to go guide for installing SQOOP in Windows 10. Image taken from Google images 1. Prerequisites Hardware Requirement * RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work. * CPU — Min. Quad-core, with at least 1.80GHz JRE 1.8 — Offline installer for JRE Java Development Kit — 1.8 A Software for Un-Zipping like 7Zip or Win Rar * I will be using 64-bit windows for the process, please check and download the version supported by your system x86 or x64 for all the software. Hadoop * I am using Hadoop-2.9.2, you can also use any other STABLE version for Hadoop. * If you don’t have Hadoop, you can refer to installing it from Hadoop: How to install in 5 Steps in Windows 10 . MySQL Query Browser Download SQOOP zip * I am using SQOOP-1.4.7, you can also use any other STABLE version for SQOOP. Fig 1:- Download Sqoop 1.4.7

Hive — How to install in 5 Steps in Windows 10

An easy to go guide for installing Hive in Windows 10. Image taken from Google images 1. Prerequisites Hardware Requirement * RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work. * CPU — Min. Quad-core, with at least 1.80GHz JRE 1.8 — Offline installer for JRE Java Development Kit — 1.8 A Software for Un-Zipping like 7Zip or Win Rar * I will be using 64-bit windows for the process, please check and download the version supported by your system x86 or x64 for all the software. Hadoop * I am using Hadoop-2.9.2, you can also use any other STABLE version for Hadoop. * If you don’t have Hadoop, you can refer to installing it from Hadoop: How to install in 5 Steps in Windows 10 . MySQL Query Browser Download Hive zip * I am using Hive-3.1.2, you can also use any other STABLE version for Hive. Fig 1:- Download Hive-3.1.2

PIG: How to install in 5 Steps in Windows 10

An easy to go guide for installing the PIG in Windows 10. Image taken from Google images 1. Prerequisites:- Hardware Requirement * RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work. * CPU — Min. Quad-core, with at least 1.80GHz JRE 1.8 — Offline installer for JRE Java Development Kit — 1.8 A Software for Un-Zipping like 7Zip or Win Rar ---- * I will be using 64-bit windows for the process, please check and download the version supported by your system x86 or x64 for all the software. Hadoop ---- * I am using Hadoop-2.9.2, you can also use any other STABLE version for Hadoop. * If you don’t have Hadoop, you can refer to installing it from Hadoop: How to install in 5 Steps in Windows 10 . MySQL Query Browser Download PIG zip ---- * I am using PIG-0.17.0, you can also use any other STABLE version of Apache Pig . Fig 1:- Download PIG-0.17.0