QuickDataScience | Quick & Easy Data Science

Posts

Showing posts with the label partitioning and bucketing in hive

Hive Default Partition

Introduction We have already studied two different partitioning techniques in Hive, Dynamic & Static Partitioning. If we look at these techniques, in Static Partitioning we define the partitions manually whereas in Dynamic Partitioning the values are assigned dynamically based on some column values. But if we look and analyse the data, we know data is never pure, it always has some impurities in it like some of the data will be missing, some are not formatted properly, some are random, etc etc. Thus, in such cases when we create partitions a problem arises where to assign the NULL values (I am supposing that we created partitions for wrong values also), do we need to create a separate partition for it, then what would be the condition for NULL values, in case of dynamic partition, when we do not have control over the partitions how are we going to deal with NULL data.

Bucketing in HIVE - Learning by Doing

< Previous Bucketing in Hive We studied the theory part involved in Bucketing in Hive in our previous article. Time to get our hands dirty now. We will be following the below pattern for the Coding part:- 1. Hadoop Installation. 2. Hive Installation. Hope we have installed, and have Hadoop and Hive running. As already discussed, there are two(2) ways to performing Bucketing. We will be discussing code for both in detail separately. We will be using the "World Happiness" dataset for demonstrating the Bucketing. 1. Bucketing with Partitioning A. First, we need to create a table and load data into it. CREATE TABLE IF NOT EXISTS <Table Name> ( <Column1 DataType>, <Column2 DataType>, <Column3 DataType>, ... ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS <file format>; Creating Table in Hive B. Now we need to LOAD the entire dataset into our table. LOAD DATA LOCAL INPATH <File Path> INTO TABLE ...