Bucketing in HIVE - Learning by Doing

We studied the theory part involved in Bucketing in Hive in our previous article. Time to get our hands dirty now.

We will be following the below pattern for the Coding part:-

Hope we have installed, and have Hadoop and Hive running.

As already discussed, there are two(2) ways to performing Bucketing. We will be discussing code for both in detail separately.

We will be using the "World Happiness" dataset for demonstrating the Bucketing.

1. Bucketing with Partitioning

A. First, we need to create a table and load data into it.

CREATE TABLE IF NOT EXISTS <Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS <file format>;

Creating Table in Hive

B. Now we need to LOAD the entire dataset into our table.

LOAD DATA LOCAL INPATH <File Path>
INTO TABLE <Table Name>;

Loading Data in Hive Table

We can also verify the data from HDFS.

Verifying Hive Data Load in HDFS

C. Next we need to create a table in which we can load our Partitioned & Bucketed data.

CREATE TABLE IF NOT EXISTS <Partition & Bucketing Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
PARTITIONED BY(<Partitioning_Column_Name DataType>)
CLUSTERED BY (<Bucketing_Column_Name) SORTED BY (<Sorting_Column_Name) INTO <N Number of Buckets> BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY <Delimiter>
STORED AS <file format>;

Creating Hive with Partitioning & Bucketing

D. Before performing actual Bucketing, we need to change few properties of Hive.

set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode=nonstrict;

**Note :- Since, we are using the concept of Dynamic Partitioning, we need to set the properties for same.

Setting hive properties for Bucketing

E. Loading Data into Bucketing Table.

The last step is to load the data into our table from the table we created in starting with full data.

INSERT OVERWRITE TABLE <Table Name>
PARTITION (<Partition Column>)
SELECT loColumn1 ,Column2, Column3,...,Partition Column
FROM <Base Table>;

**Note:- Partition Column has to be the last column in the SELECT columns.

Loading Data into Bucketing Table

Now we can verify if the partitions & Buckets were created successfully or not.

HDFS directory showing partitions in hive table

HDFS directory showing Buckets in hive table

We can see here, we have first created partitions based on 'Age_Group' and then 10 buckets have been created inside each partition.

2. Bucketing without Partitioning

A. First, we need to create a table and load data into it.

CREATE TABLE IF NOT EXISTS <Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS <file format>;

Creating Table in Hive

B. Now we need to LOAD the entire dataset into our table.

LOAD DATA LOCAL INPATH <File Path>
INTO TABLE <Table Name>;

Loading Data in Hive Table

C. Creating a table without the Partition by clause

CREATE TABLE IF NOT EXISTS <Partition & Bucketing Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
CLUSTERED BY (<Bucketing_Column_Name) SORTED BY (<Sorting_Column_Name) INTO <N Number of Buckets> BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY <Delimiter>
STORED AS <file format>;

Creating Hive with Bucketing

D. Changing properties

set hive.enforce.bucketing = true;

Setting hive properties for Bucketing

E. Loading Data into Bucketing table

The last step is to load the data into our table from the table we created in starting with full data.

INSERT OVERWRITE TABLE <Table Name>
SELECT loColumn1 ,Column2, Column3,...,Partition Column
FROM <Base Table>;

HDFS directory showing Buckets in hive table

Summary

We learned the code involved behind the bucketing in Hive. We saw different ways and operations that we can perform over the bucketing.

Its right said:- "Practice Makes a Man Perfect"

So don't just read, practice along with us and let us know the issues and challenges if you face any.

Until Then... This is Quick Data Science Team signing off. See you with another article.

QuickDataScience | Quick & Easy Data Science

Search This Blog

Bucketing in HIVE - Learning by Doing

1. Bucketing with Partitioning

A. First, we need to create a table and load data into it.

B. Now we need to LOAD the entire dataset into our table.

C. Next we need to create a table in which we can load our Partitioned & Bucketed data.

D. Before performing actual Bucketing, we need to change few properties of Hive.

E. Loading Data into Bucketing Table.

2. Bucketing without Partitioning

A. First, we need to create a table and load data into it.

B. Now we need to LOAD the entire dataset into our table.

C. Creating a table without the Partition by clause

D. Changing properties

E. Loading Data into Bucketing table

Summary

Labels

Comments

Post a Comment