QuickDataScience | Quick & Easy Data Science

Posts

Showing posts with the label partitioning

Hive Default Partition

Introduction We have already studied two different partitioning techniques in Hive, Dynamic & Static Partitioning. If we look at these techniques, in Static Partitioning we define the partitions manually whereas in Dynamic Partitioning the values are assigned dynamically based on some column values. But if we look and analyse the data, we know data is never pure, it always has some impurities in it like some of the data will be missing, some are not formatted properly, some are random, etc etc. Thus, in such cases when we create partitions a problem arises where to assign the NULL values (I am supposing that we created partitions for wrong values also), do we need to create a separate partition for it, then what would be the condition for NULL values, in case of dynamic partition, when we do not have control over the partitions how are we going to deal with NULL data.

Partitioning in HIVE - Learning by Doing

< Previous Partitioning in Hive We studied the theory part involved in Partitioning in Hive in our previous article. Time to get our hands dirty now. We will be following the below pattern for the Coding part:- 1. Hadoop Installation . 2. Hive Installation . 3. Static Partitioning. {The theory part is covered in the previous article.} 4. Dynamic Partitioning. {The theory part is covered in the previous article.} Hope we have installed, and have Hadoop and Hive running.

Partitioning in Hive

What is Partitioning? In simple words, we can explain Partitioning as the process of dividing something into sections or parts, with the motive of making it easily understandable and manageable. In our everyday routine also, we use this concept to ease out our tasks and save time. But we do it so abruptly that we hardly notice how we did it. Let's see an example and get familiar with the concept. Suppose we have a deck of cards and need to fetch "Jack of Spades" from the deck of cards. So now there are two ways in which we can accomplish this task. We can start turning over every card one by one, starting from the top/bottom until we reach our card. We group the deck according to suit, i.e. clubs, hearts, spades, diamonds. Now, as soon we hear "Spades", we know which group to look for, thus dividing our work 1/4 times. This grouping of our data according to some specific category reduced our work and saved energy, time and effort. Defining in Technical Term...