Skip to main content

Internal VS External


Introduction

Hive may not be the first term that pops in our mind when we talk about Hadoop & Big Data, but it is definitely a term, a tool, a tech that everyone discusses as they proceed in their Big Data journey & never parts from it. 

In simple words, if we want to explain What is Hive to any new data science enthusiast, in a single line we can say "Hive is the SQL for big data".  Why? Because it is used to manage huge structured data, query and analyse that data & it sits on top of Hadoop. 

Hive is a data warehouse infrastructure, that is used to process, query and analyse the structured data in Hadoop. Structured data, the data having definite structure, i.e. table format. It is designed similar to SQL, similar interface, similar queries. The difference is that like other Hadoop techs in the ecosystem implements Map-Reduce to perform the required task, similarly, traditional SQL queries in the hive, known as HQL(Hive Query Language), implements Map-Reduce jobs to fetch/analyse the data stored across the Hadoop cluster.

Unlike SQL, when creating & storing data in the hive, it provides us with two options for creating a table namely:- Internal Table & External Table. So, let's dive deeper and look at what they are and what differentiates them. 

 

Internal Table

The internal table also referred to as Managed Table, is the default table created by the hive, whenever we create a table. It is called Internal Table because these tables are stored, by default inside the warehouse directory of HDFS. Default HDFS warehouse directory is /user/hive/warehouse.

This default warehouse location can be changed by setting the desired path in hive-site.xml under the hive.metastore.warehouse.dir property. 

It is also known as "Managed Table" because these tables are managed by the hive itself. This simply means, Hive not only manages the data of these tables but also the metadata of these tables is managed by Hive. Thus, any external Hadoop utility or outside hive is required to access these tables, it won't be possible and thus the need for External Table. 

External Table

To create an External table we need to specify it during the create statement using the "EXTERNAL" keyword. As the name suggests these tables are not created inside the Hive and can be easily accessed outside Hive by any Hadoop utility like Pig etc. 

Being an external table, the hive does not have control over it and can be stored at a location that can be accessed by any other utilities. Still, Hive owns the metadata of the table and manages it. 

Let's move ahead and see the differences between the two types of Hive tables.

Internal VS External





We have summarized the differences between Hive Internal & External table in the image above. 

Let's have a detailed overview

1. Data Loading:-  


As already discussed, the first and foremost difference between the two is storage. In the case of the Internal Table, Both the data & table are stored inside the Hive warehouse and managed by it. Whereas in the case of the External table data is NOT loaded in the Hive warehouse. 

Practical:- 


a. Creating Internal Table 

CREATE TABLE QDSinternalTable(id INT, Name STRING, age INT, city STRING, salary STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; 

Creating Internal Table
Creating Internal Table


b. Loading Data in Internal Table

LOAD DATA LOCAL INPATH 'D:/Shashank/Study/hive/customershive.txt' INTO TABLE QDSinternalTable;

Loading Data into Internal Table
Loading Data into Internal Table

c. Verifying Data

Once, the data is loaded we can verify its path from HDFS.

Data in Internal table
Data in Internal table

 Data can also be verified using the below query. 

SELECT * FROM QDSinternalTable;

Selecting data from internal table
Selecting data from internal table



We can notice here, we haven't specified any path in the HDFS directory for loading or creating tables, still, data is loaded and created in the HDFS warehouse directory.

d. Describing Table

DESCRIBE FORMATTED QDSinternalTable;

Describing Internal Table
Describing Internal Table


We can see Table Type for Internal Table is Managed Table. 

Now, before proceeding for External Table, we would like to Load data in an external HDFS directory

hdfs dfs -put D:/Shashank/Study/hive/customershive.txt /QDS_test

Loading Data in HDFS
Loading Data in HDFS


Data in HDFS
Data in HDFS



e. Creating External Table 

CREATE EXTERNAL TABLE QDSexternalTable(id INT, Name STRING, age INT, city STRING, salary STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/QDS_test';

Creating External Table
Creating External Table


We have used "EXTERNAL" keyword data in CREATE statement for the external table. 

f. Describing External Table

DESCRIBE FORMATTED QDSexternalTable;

Describing External Table
Describing External Table

Table type here is EXTERNAL TABLE. 

g. Verifying Data

Since data is loaded in HDFS already, we need to check if we are getting the desired resultset from the SELECT statement.

SELECT * FROM QDSexternalTable;

Select external table
Select external table


2. Truncating the Table


The truncate command is used to delete the data from the table without disturbing the structure of the table. Since data of the EXTERNAL table is not managed by Hive, it is not possible to truncate the external table. Whereas, we can easily truncate the INTERNAL table.

TRUNCATE TABLE QDSexternalTable;

Truncate External Table
Truncate External Table


3. Dropping Table


When we drop the tables, i.e. Internal & External table, Hive drops the table and data for the Internal table but can drop only the table structure for the External table, data, in this case, remains safe,i.e. not dropped. 

a. Dropping Internal Table 

DROP TABLE QDSinternalTable;

Drop Internal Table
Drop Internal Table


Dropped Internal Table Data in HDFS
Dropped Internal Table Data in HDFS



b. Dropping External Table

DROP TABLE QDSexternalTable;

Drop External Table
Drop External Table

External Table data is not dropped
External Table data is not dropped


4. ACID 

The external table does not support ACID properties, that's because data is present in HDFS. And only Hive supports ACID properties, not HDFS. Thus, it makes external tables ineligible for ACID compliance. Whereas Internal tables happily comply with ACID properties. 

5. Caching

Query caching is a useful way to reduce the data processing time but unfortunately, this capability is only available for internal tables. We can not use caching for external table queries.

6. Accessibility & Management

In the case of the Internal Table, Hive has full control over the data & metadata of the table and does not allow any outside entity to access it. Whereas in the case of the External Table, Hive only owns the Metadata of the table and not the data, thus making the data to be accessible outside the Hive.

Summary

We have studied about two different types of tables supported by Hive, i.e. Internal & External tables.
We studied What they are, how to create them and the basic differences between them. We have seen how to use them and learned the practical aspects as well.  

So don't just read, practice along with us and let us know the issues and challenges if you face any. 

Until Then... This is Quick Data Science Team signing off. See you with another article. 


 

Comments

  1. I appreciate you taking the time and effort to share your knowledge. This material proved to be really efficient and beneficial to me. Thank you very much for providing this information. Continue to write your blog.

    Data Engineering Services 

    Machine Learning Solutions

    Data Analytics Solutions

    Data Modernization Services

    ReplyDelete
    Replies
    1. Thanks Oliver, loved to hear that you like our blog and a regular reader. Just checked your services and have to say they are quite good, looking forward to collaborate sometime soon.

      Delete
  2. I appreciate you taking the time and effort to share your knowledge. This material proved to be really efficient and beneficial to me. Thank you very much for providing this information. Continue to write your blog.

    Data Engineering Services 

    Machine Learning Solutions

    Data Analytics Solutions

    Data Modernization Services

    ReplyDelete
    Replies
    1. Thanks Aaron, loved to hear that you like our blog and a regular reader. Just checked your services and have to say they are quite good, looking forward to collaborate sometime soon.

      Delete

Post a Comment