Partitioning in Hive

What is Partitioning?

In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable.

In our everyday routine also, we use this concept to ease out our tasks and save time. But we do it so abruptly that we hardly notice how we did it.

Let's see an example and get familiar with the concept.

Suppose we have a deck of cards and need to fetch "Jack of Spades" from the deck of cards. So now there are two ways in which we can accomplish this task.

We can start turning over every card one by one, starting from the top/bottom until we reach our card.
We group the deck according to suit, i.e. clubs, hearts, spades, diamonds. Now, as soon we hear "Spades", we know which group to look for, thus dividing our work 1/4 times.

This grouping of our data according to some specific category reduced our work and saved energy, time and effort.

Defining in Technical Term

Partitioning means dividing our data into simple logical parts based on the value of one or more partitioning keys. It eliminates the headache of creating, managing and accessing smaller tables separately. As Hive partitions the data/table into smaller parts/subdirectories in the backend, but for the user, it remains a single table only from which we can access, manage the data.

Importance of Partitioning

Big Data is used to define the technologies used to handle the humongous amount of data. The data is generated by us daily through social media, online searches, e-shopping, CCTV, or any other means and can vary up to petabytes. Thus, to store and manage these high volumes of data HDFS is used. Due to this, it becomes hard to query or perform any action over this data.

So, Hive was introduced as a querying tool of Big Data with its querying language Hive Query Language(HQL), which breaks down the standard query into MapReduce tasks for querying the data. But, due to the high volume of data Map Reduce jobs also become insufficient, so a concept of breaking down data into smaller parts -- Partitioning was introduced.

Enough of technical explanation. Let's explain it in simpler terms.

In the above example of cards, the first one takes considerably more time than the second one on trying both ways.

Thus, with the simple act of dividing the deck according to suits, we were able to find the card faster. Now, imagine the same scenario with petabytes of data querying required data would take much more time. So, we use partitioning to divide and query data faster.

In Bullets, the importance of Partitioning is:-

Reduced time
Better performance

Types of Partitioning

Partitioning is of two types:-

Static Partitioning
Dynamic Partitioning

Static Partitioning

It means we "statically" create/add the partitions on our own and insert data individually in them.
This method is preferred over "Dynamic Partitioning" for loading large data as it saves time.
It also allows us to alter the partitions, and we can get the partition column value from the filename without reading the whole file.
To limit the data in partitions we can use the 'where' clause.
This method is used for both the Hive Manage table or the external table.
it is done in "Strict Mode" and can be set by changing the below property in Hive-site.xml

set hive.mapred.mode = strict

Dynamic Partitioning

In this method, data is loaded from the non-partitioned table into a partition table.
Since data has to be loaded from a single table and partitioned at run time, it takes more time to load.
It is preferred when we have huge data stored in a table and don't know how many partitions to create. Thus, we can't alter the Dynamic Partitions.
This method is used for both the Hive Manage table or the external table.
It is done in "Non-Strict Mode" and can be set by changing the below property in Hive-site.xml

set hive.mapred.mode = non-strict

Unlike Static Partitioning, we can not use the 'where' clause to limit the data in partitions.

Advantages

Partitioning distributes the load horizontally.
Query execution is faster as the volume of data to be queried is low.
Only a particular sub-directory needs to be searched for an entry, not the entire dataset.

Disadvantages

There is a possibility of creating too many sub-directories.
Not good for the group by clause where data need to be searched across all partitions. Eg:- we have created partitions by months, but require data based on year.

Enough of theory, Next we will see the code behind it.

By this time, let's install Hadoop and Hive in our local system for practice.

Partitioning in HIVE - Learning by Doing Next >

QuickDataScience | Quick & Easy Data Science

Search This Blog