Hive Default Partition

Introduction

We have already studied two different partitioning techniques in Hive, Dynamic & Static Partitioning. If we look at these techniques, in Static Partitioning we define the partitions manually whereas in Dynamic Partitioning the values are assigned dynamically based on some column values.

But if we look and analyse the data, we know data is never pure, it always has some impurities in it like some of the data will be missing, some are not formatted properly, some are random, etc etc. Thus, in such cases when we create partitions a problem arises where to assign the NULL values (I am supposing that we created partitions for wrong values also), do we need to create a separate partition for it, then what would be the condition for NULL values, in case of dynamic partition, when we do not have control over the partitions how are we going to deal with NULL data.

Hive Default Partition

Believe me, the problems that I have stated above are not mere simple problems, a NULL value can destroy the complete analysis, models, and any such process performed over them. Thus, it was a big problem not just for us but even for the HIVE Designers.

To overcome this issue they devised a simple way to handle these NULL values, considering them to be unavoidable, they came up with the logic of creating a partition by default for NULL values and gave it the name of HIVE_DEFAULT_PARTITION.

In the case of Dynamic Partitioning, whenever Hive encounters a NULL value, then and there it creates a partition, HIVE_DEFAULT_PARTITION, where it assigns all the rows having NULL value for the partitioned column. Whereas in the case of Static Partitioning, the partitions are defined by us, so while segregating the data into these partitions, Hive keeps a check for the NULL values and creates a separate partition, HIVE_DEFAULT_PARTITION, for them.

Now, we might think if this functionality is offered by default or do we need to change any configuration to achieve the same. The answer is, this is offered by default in Hive and we do not change or touch any configuration.

So, without wasting any single minute, let's learn the practical approach for the same.

Practical

Let's check it first with dynamic partition.

1. Creating a simple Hive table

Firstly, we need to create simple hive data.

CREATE TABLE IF NOT EXISTS <Table Name> (<Column1 DataType>, <Column2 DataType>,<Column3 DataType>,... )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS <file format>;

Hive table in HDFS

CREATE TABLE IF NOT EXISTS QDS_default_partition
(prsnID String,prsnName String, prsnAge String, prsnCity String, prsnSal String)
Row format delimited
fields terminated by ','
stored as textfile;

2. Insert/Load the Raw Data.

LOAD DATA LOCAL INPATH <Raw Data File Path> INTO TABLE <Table Name>;

LOAD DATA LOCAL INPATH 'D:\Shashank\Study\hive\customershive.txt' INTO table QDS_default_partition;

We can verify the data using the Select statement(as below)

SELECT from the table

select * from QDS_default_partition;

3. Creating a Partitioned Table

CREATE TABLE IF NOT EXISTS <Table Name> (<Column1 DataType>, <Column2 DataType>,<Column3 DataType>,... )
PARTITIONED BY (<Partition Column DataType>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS <file format>;

Hive Dynamic table in HDFS

CREATE TABLE IF NOT EXISTS QDS_dynamic_default_partition
(prsnID String,prsnName String, prsnAge String, prsnSal String)
PARTITIONED BY (prsnCity String);

4. Changing Hive Setting

This is a very important step, if missed can cause the error and won't create dynamic partitions.

Here, we need to change the default "Strict" mode of Hive to "NonStrict" and set Dynamic partitioning to "True".

set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;

Changing Mode in Hive

I have attached the property tag for the same present in hive-site.xml

<property>
<name>hive.exec.dynamic.partition.mode</name>
<value>strict</value>
<description>
In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions.
In nonstrict mode all partitions are allowed to be dynamic.
</description>
</property>

5. Loading data in Partitioned table

Now, we simply need to load the data in our dynamic table and it will be partitioned on its own.

INSERT OVERWRITE TABLE <Partition Table Name>
PARTITION(<Partition Column>)
SELECT Column1 ,Column2, Column3,Partition Column
FROM <Raw Data Table>;

Hive Default Partition

We can clearly see that hive, as promised, has created a default partition for us. If we go ahead and see the data in it, it will be having 2 records of Dhoni and Raina, as they had NULL in cities.

Great..!!! we were able to create a default partition in the hive for NULL values.

Summary

We learned the code involved in creating default partitions in Hive.

Its right said:- "Practice Makes a Man Perfect"

So don't just read practice along with us and let us know the issues and challenges if you face any.

Until Then... This is Quick Data Science Team signing off. See you with another article.

QuickDataScience | Quick & Easy Data Science

Search This Blog

Hive Default Partition

Introduction

Hive Default Partition

Practical

1. Creating a simple Hive table

2. Insert/Load the Raw Data.

3. Creating a Partitioned Table

4. Changing Hive Setting

5. Loading data in Partitioned table

Summary

Labels

Comments

Post a Comment