We studied the theory part involved in Bucketing in Hive in our previous article. Time to get our hands dirty now.
We will be following the below pattern for the Coding part:-
Hope we have installed, and have Hadoop and Hive running.
As already discussed, there are two(2) ways to performing Bucketing. We will be discussing code for both in detail separately.
We will be using the "World Happiness" dataset for demonstrating the Bucketing.
1. Bucketing with Partitioning
A. First, we need to create a table and load data into it.
CREATE TABLE IF NOT EXISTS <Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS <file format>;
Creating Table in Hive |
B. Now we need to LOAD the entire dataset into our table.
LOAD DATA LOCAL INPATH <File Path>
INTO TABLE <Table Name>;
Loading Data in Hive Table |
We can also verify the data from HDFS.
Verifying Hive Data Load in HDFS |
C. Next we need to create a table in which we can load our Partitioned & Bucketed data.
CREATE TABLE IF NOT EXISTS <Partition & Bucketing Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
PARTITIONED BY(<Partitioning_Column_Name DataType>)
CLUSTERED BY (<Bucketing_Column_Name) SORTED BY (<Sorting_Column_Name) INTO <N Number of Buckets> BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY <Delimiter>
STORED AS <file format>;
Creating Hive with Partitioning & Bucketing |
D. Before performing actual Bucketing, we need to change few properties of Hive.
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode=nonstrict;
**Note :- Since, we are using the concept of Dynamic Partitioning, we need to set the properties for same.
Setting hive properties for Bucketing |
E. Loading Data into Bucketing Table.
The last step is to load the data into our table from the table we created in starting with full data.
INSERT OVERWRITE TABLE <Table Name>
PARTITION (<Partition Column>)
SELECT loColumn1 ,Column2, Column3,...,Partition Column
FROM <Base Table>;
**Note:- Partition Column has to be the last column in the SELECT columns.
Loading Data into Bucketing Table |
Now we can verify if the partitions & Buckets were created successfully or not.
HDFS directory showing partitions in hive table |
HDFS directory showing Buckets in hive table |
We can see here, we have first created partitions based on 'Age_Group' and then 10 buckets have been created inside each partition.
2. Bucketing without Partitioning
A. First, we need to create a table and load data into it.
CREATE TABLE IF NOT EXISTS <Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS <file format>;
Creating Table in Hive |
B. Now we need to LOAD the entire dataset into our table.
LOAD DATA LOCAL INPATH <File Path>
INTO TABLE <Table Name>;
Loading Data in Hive Table |
C. Creating a table without the Partition by clause
CREATE TABLE IF NOT EXISTS <Partition & Bucketing Table Name>
(
<Column1 DataType>,
<Column2 DataType>,
<Column3 DataType>, ...
)
CLUSTERED BY (<Bucketing_Column_Name) SORTED BY (<Sorting_Column_Name) INTO <N Number of Buckets> BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY <Delimiter>
STORED AS <file format>;
Creating Hive with Bucketing |
D. Changing properties
set hive.enforce.bucketing = true;
Setting hive properties for Bucketing |
E. Loading Data into Bucketing table
The last step is to load the data into our table from the table we created in starting with full data.
INSERT OVERWRITE TABLE <Table Name>
SELECT loColumn1 ,Column2, Column3,...,Partition Column
FROM <Base Table>;
HDFS directory showing Buckets in hive table |
Summary
We learned the code involved behind the bucketing in Hive. We saw different ways and operations that we can perform over the bucketing.
Its right said:- "Practice Makes a Man Perfect"
So don't just read, practice along with us and let us know the issues and challenges if you face any.
Until Then... This is Quick Data Science Team signing off. See you with another article.
Comments
Post a Comment