Skip to main content

Bucketing in HIVE - Learning by Doing


< Previous  Bucketing in Hive


We studied the theory part involved in Bucketing in Hive in our previous article. Time to get our hands dirty now. 

We will be following the below pattern for the Coding part:- 

1. Hadoop Installation.

2. Hive Installation. 

Hope we have installed, and have Hadoop and Hive running.

As already discussed, there are two(2) ways to performing Bucketing. We will be discussing code for both in detail separately.  

We will be using the "World Happiness" dataset for demonstrating the Bucketing.

1. Bucketing with Partitioning

A. First, we need to create a table and load data into it.

CREATE TABLE IF NOT EXISTS <Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS <file format>;

Creating Table in Hive
Creating Table in Hive


B. Now we need to LOAD the entire dataset into our table. 


LOAD DATA LOCAL INPATH <File Path>
INTO TABLE <Table Name>;

Loading Data in Hive Table
Loading Data in Hive Table


    We can also verify the data from HDFS.

Verifying Hive Data Load in HDFS
Verifying Hive Data Load in HDFS


C. Next we need to create a table in which we can load our Partitioned & Bucketed data.

CREATE TABLE IF NOT EXISTS <Partition & Bucketing Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
PARTITIONED BY(<Partitioning_Column_Name  DataType>)
CLUSTERED BY (<Bucketing_Column_Name) SORTED BY (<Sorting_Column_Name) INTO <N Number of Buckets> BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY <Delimiter>
STORED AS <file format>;

Creating Hive with Partitioning & Bucketing
Creating Hive with Partitioning & Bucketing

D.   Before performing actual Bucketing, we need to change few properties of Hive. 

        set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode=nonstrict;

**Note :- Since, we are using the concept of Dynamic Partitioning, we need to set the properties for same.

Setting hive properties for Bucketing
Setting hive properties for Bucketing


E. Loading Data into Bucketing Table. 

The last step is to load the data into our table from the table we created in starting with full data. 

INSERT OVERWRITE TABLE <Table Name>
PARTITION (<Partition Column>)
SELECT loColumn1 ,Column2, Column3,...,Partition Column
FROM <Base Table>;

**Note:- Partition Column has to be the last column in the SELECT columns.

Loading Data into Bucketing Table
Loading Data into Bucketing Table

Now we can verify if the partitions & Buckets were created successfully or not. 

HDFS directory showing partitions in hive table
HDFS directory showing partitions in hive table

HDFS directory showing Buckets in hive table
HDFS directory showing Buckets in hive table

We can see here, we have first created partitions based on 'Age_Group' and then 10 buckets have been created inside each partition.

2. Bucketing without Partitioning

A.  First, we need to create a table and load data into it.

CREATE TABLE IF NOT EXISTS <Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS <file format>;

Creating Table in Hive
Creating Table in Hive

B. Now we need to LOAD the entire dataset into our table. 

LOAD DATA LOCAL INPATH <File Path>
INTO TABLE <Table Name>;

Loading Data in Hive Table
Loading Data in Hive Table

C. Creating a table without the Partition by clause

CREATE TABLE IF NOT EXISTS <Partition & Bucketing Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
CLUSTERED BY (<Bucketing_Column_Name) SORTED BY (<Sorting_Column_Name) INTO <N Number of Buckets> BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY <Delimiter>
STORED AS <file format>;

Creating Hive with  Bucketing
Creating Hive with  Bucketing

D. Changing properties

        set hive.enforce.bucketing = true;

Setting hive properties for Bucketing
Setting hive properties for Bucketing

E. Loading Data into Bucketing table

The last step is to load the data into our table from the table we created in starting with full data. 

INSERT OVERWRITE TABLE <Table Name>
SELECT loColumn1 ,Column2, Column3,...,Partition Column
FROM <Base Table>;

HDFS directory showing Buckets in hive table
HDFS directory showing Buckets in hive table

Summary

We learned the code involved behind the bucketing in Hive. We saw different ways and operations that we can perform over the bucketing. 

Its right said:-   "Practice Makes a Man Perfect" 

So don't just read, practice along with us and let us know the issues and challenges if you face any. 

Until Then... This is Quick Data Science Team signing off. See you with another article. 

Comments