QuickDataScience | Quick & Easy Data Science

Posts

Showing posts with the label Bucketing in Hive

Bucketing in HIVE - Learning by Doing

< Previous Bucketing in Hive We studied the theory part involved in Bucketing in Hive in our previous article. Time to get our hands dirty now. We will be following the below pattern for the Coding part:- 1. Hadoop Installation. 2. Hive Installation. Hope we have installed, and have Hadoop and Hive running. As already discussed, there are two(2) ways to performing Bucketing. We will be discussing code for both in detail separately. We will be using the "World Happiness" dataset for demonstrating the Bucketing. 1. Bucketing with Partitioning A. First, we need to create a table and load data into it. CREATE TABLE IF NOT EXISTS <Table Name> ( <Column1 DataType>, <Column2 DataType>, <Column3 DataType>, ... ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS <file format>; Creating Table in Hive B. Now we need to LOAD the entire dataset into our table. LOAD DATA LOCAL INPATH <File Path> INTO TABLE ...

Bucketing in Hive

Today, we are dealing with a big problem of Big Data, where a huge amount of data is generated every second and minute. Thus, the issue of storing such a huge amount of data arises, which is managed using various SQL, NoSQL and now NewSQL databases. But still, a problem remains if we store the data as it is generated in our databases, it gets difficult to query such huge data. Thus, there was a need for some technique that could help in splitting the data at the time of storing, providing not only fast and easy access to data but also in easy storage. To cater for the issue of storing and managing Big Data, Hive was introduced, which further provides concepts like Partitioning and Bucketing to solve the issue of storing and querying huge datasets.