Hive Insert Into vs Insert Overwrite

In Previous chapter we learned about HIVE SHOW PARTITION and today lets check out the difference between Hive Insert Into vs Insert Overwrite. We will also discuss the impact on both Hive Partitioned and Non-Partitioned tables in the blog below.

Simply put Insert Into command appends the rows in the existing table whereas Insert Overwrite as the name suggests overwrites the data in the table. . For instance, if the table has 2 rows and we INSERT INTO 3 rows then the table will have 5 rows in total. And if the table has 2 rows and we INSERT OVERWRITE 3 rows , then the old 2 rows are deleted and the new 3 rows are inserted. Therefore a total 3 rows will be left in the table.

Before discussing the difference further ,lets first understand the syntax and how to insert data into Hive tables. To know more about how to create Hive Internal or External tables you can visit this article.

Table of Contents

Different Methods to Insert Data into Hive Table

Let us assume we have a table named ‘dummy’ which has 2 records in it . We will first INSERT INTO the table 3 records and then we will INSERT OVERWRITE the table with 3 records in all the below methods.

Insert Using Load Command

We can use LOAD DATA INPATH command to load a file in HDFS location into a Hive Table. The complete syntax of this is as follows

LOAD DATA INPATH ‘hdfs_path’ INTO TABLE table_nm
LOAD DATA INPATH ‘hdfs_path’ OVERWRITE INTO TABLE table_nm

Here the ins.csv file has 3 records.

--Appends the data in Dummy Table  
LOAD DATA INPATH '/home/path/ins.csv' INTO TABLE Dummy;

--Overwrites the data in Dummy Table  
LOAD DATA INPATH '/home/path/ins.csv' OVERWRITE INTO TABLE Dummy;

Insert Using Values Clause

If you need to insert couple of records into the table this method would be the easiest to use. The syntax is as follows

INSERT INTO TABLE table_nm VALUES [val1,val2,..valn]
INSERT OVERWRITE TABLE table_nm VALUES [val1,val2,..valn]

This method would normally be used for test purpose or to quickly check your logic by inserting few records into the table.

--Appends the data in Dummy Table
INSERT INTO table dummy values("Riko","Ponting"),("Mark","Henry"),("Dwane","Hon")

--Overwrites the data in Dummy Table 
INSERT OVERWRITE table dummy values("Riko","Ponting"),("Mark","Henry"),("Dwane","Hon")

Insert Using Select Clause

This is the most common way you will be using to insert data into table. For instance, you fetch the data from one table and insert into another. The syntax is as follows

INSERT INTO TABLE dest_table Select * from src_table
INSERT OVERWRITE TABLE dest_table Select * from src_table

Here the src_table table has 3 records.

--Appends the data in Dummy Table
INSERT INTO TABLE dummy Select * from dummy_src 

--Overwrites the data in Dummy Table 
INSERT OVERWRITE TABLE dummy Select * from dummy_src

No matter which method we use, after we INSERT INTO the ‘dummy’ table, it will have 6 records in total . And when we use INSERT OVERWRITE the ‘dummy’ table will have 3 records.

Insert Into vs Insert Overwrite on Table with Partitions

Insert Into : Using this we insert the data in specific table partition. We don’t remove the existing records rather we only insert the new records. We do not touch the data of remaining partition.

Insert Overwrite: Using this we overwrite the data on the specific partition. In other words, We remove all the data in the specific partition and then insert new records. We do not touch the data of remaining partition.

Conclusion

So today we learnt

Different methods to insert data into Hive Table
Difference between Hive Insert Into vs Insert Overwrite on Partition and Non Partition Table

🙂 kudos for learning something new 🙂