Using Hive Partition you can divide a table horizontally into multiple sections. This division happens based on a partition key which is just a column in your Hive table. Through out this lesson we will understand various aspects of Hive Partition.
Table of Contents
why use Partition in Hive
To understand this first lets look at a scenario. Lets say there is a multinational bank name ABC_BANK which spans across multiple countries. Now we have a table which contains information of new customers named ‘new_cust‘. Lets say you want to find out count of new customers from ‘USA’ . Hive does this as below
- Search across all the countries and filter records of ‘USA’
- Count the number of new customers from ‘USA’
This will give the correct output but can we optimize this so that Hive fetches record faster. What if Hive already new where records belonging to USA is present so it didn’t have to go through all the countries records. This can be achieved using Hive Partition.
Lets convert the country column present in ‘new_cust’ table into a Hive partition column. What this would do is it will create a partition [which is basically a folder] for each country and move its related data into it. So next time when we run the query to fetch new customer from USA or any other country, Hive would know that it needs to look inside that particular partition/folder and fetch the relevant data, Hence reducing the overall time spent and improving the performance.
It may be hard to understand this, but in later part of this lesson I will show you exactly what happens when you create a partition on a table with screen shot so that you can visualize better.
how to create partition in hive table
You can create partition on a Hive table using Partitioned By clause. The columns can be partitioned on an existing table or while creating a new Hive table. Both internal/managed and external table supports column partition. The basic syntax to partition is as below
create [external ]table tbl_nm (col1 datatyape , col2 datatype ..) Partitioned By (coln datatype);
create partition on hive managed table
Hive by default created managed/internal tables and we can create the partitions while creating the table. You can read more about Hive managed table here. When a table is created internally a folder is created in HDFS with the same name , inside which we store all the data, When you create partition columns Hive created more folders inside the parent table folder and then stores the data . This could be confusing so lets check an example.
Lets create a table named int_test, which contains customer id and customer name and state from which the customer belongs. And we will create a partition column ‘Country’.
create table int_test ( cust_id int, cust_nm string, state string ) partitioned by (country string) //HDFS /user/hive/warehouse/int_test alter table int_test add if not exists partition(country = 'USA') alter table int_test add if not exists partition(country = 'INDIA') //HDFS /user/hive/warehouse/int_test /user/hive/warehouse/int_test/country=INDIA /user/hive/warehouse/int_test/country=USA
You see above that when you create a table a folder names ‘int_test’ got created in HDFS. Next when you add partitions [USA , INDIA] those become new folders created inside the table folder [int_test]. When you insert data the data will reside in their respective partition.
Also note that you can create partition on multiple column, like you can create partition on Country and State and
create partition on hive external table
You can create partition on Hive External table same as we did for Internal Tables. The only difference is when you drop a partition on internal table the data gets dropped as well, but when you drop a partition on external table the data remains as is. You can learn more about Hive External Table here.
types of Hive Partition
Till now we have learned how to create partition on Hive table and now we will learn the different types of Hive Partition. Basically there are two types Static Partition and Dynamic Partition.
hive static partition
In Hive static Partition we manually specify the partition in which the data needs to be inserted. Before inserting you need to set the property ‘set hive.mapred.mode = strict‘ . Also as the entire data gets inserted at one go hence this is way faster than dynamic partition.
Lets insert data into int_test table which we had created earlier and load the data into country ‘CANADA’.
insert into int_test partition(country = 'CANADA') Select 111 , 'Mark Anthony' , 'Ottawa' union all Select 112 , 'Nancy Hugh' , 'Ottawa' //HDFS / user/ hive/ warehouse/ int_test/ country=CANADA/ 000000_0 //Data 111,Mark Anthony,Ottawa 112,Nancy Hugh,Ottawa
Here you see that a Partition named ‘Canada’ is created and data is inserted into this . You can also create another Partition ‘Norway’ and insert data into it as well. One thing you notice is that we didn’t have to specify the Partition column in the Select.
In this method the insertion is fast as we are dumping the entire data, but the process is slow as you can insert data into 1 partition each time.
Note: When you use Insert Into the is added into any existing data in the partition. But when you use Insert Overwrite you delete the existing data in the partition and insert the new data.
hive dynamic partition
Using Hive Dynamic Partition you can create and insert data into multiple Partitions. You don’t have to specify the Partition names before hand, you just need to specify the column which acts as the partition and Hive will create a partition for each unique value in the column.
The configuration you need to enable is
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Lets check an example
insert into int_test partition(country) Select 111 , 'Mark Anthony' , 'Ottawa', 'Canada' union all Select 112 , 'Nancy Hugh' , 'California' , 'USA' union all Select 113 , 'Will Doughlas' , 'Michigan' , 'USA' union all Select 114 , 'Balwas Singh' , 'Delhi' , 'INDIA' //HDFS /user/hive/warehouse/int_test/country=Canada /user/hive/warehouse/int_test/country=INDIA /user/hive/warehouse/int_test/country=USA
In the above example 3 partitions got created dynamically. This is useful when you you want to insert a data containing multiple partitions into a table in one go. Of course this will be slower than static partition as the compiler needs to figure out where each row belongs. One more difference is , unlike Static Partition we have to mention the partition column value in the select statement.
add , rename & drop Hive Partition
You can add ,rename and drop a Hive Partition in an existing table. This gives us the flexibility to make changes to the table without dropping and creating and loading the table again.
Add Partition
In the table Int_Test we already have couple of country partitions. What if we want to add some more country partitions manually ex:- Dubai and Nepal. The syntax is as below
alter table tbl_nm [add if not exists] partition( col_nm =’value’ , …..) location ‘loc’
alter table int_test add if not exists partition(country = 'Dubai') alter table int_test add if not exists partition(country = 'Nepal')
Rename Partition
Using this you can rename an existing Hive Partition value. Lets see an example where we change the Partition Value USA to United States of America. Below is the syntax to rename a Hive Partition
alter table table_name PARTITION (col = ‘value’) RENAME TO PARTITION (col = ‘new_value’);
alter table int_test partition(country = 'USA') rename to partition(country = 'United States of America')
Drop Partition
Dropping Hive Partition is pretty straight forward just remember that when you drop partition of an internal table then the data is deleted but when you drop from an external table the data remains as it is in the external location. The syntax is as below
alter table tbl_nm drop if exists partition (col = ‘value’ , …..)
alter table int_test drop if exists partition (country = 'Canada')
Cons of using Hive Partition
Hive partition is a very powerful feature but like every feature we should know when to use and when to avoid.
- If all the queries we are running is on the complete data set then there is not point in partitioning the data as every time we will process all the records.
- Partition columns should be picked for the column which is frequently used in where clause .
- The column we choose to partition should have more number of unique data. If for example instead of using Country column to partition we partition on Customer column , then thousands of partitions will be created which will be a pain for metastore and also for query processing.
Conclusion
So today we learnt
- why we should have partitioned column in Hive Table
- how to create partition column in Hive Internal and External Table
- types of hive partitions. STATIC & DYNAMIC
- how to add, rename and drop partition columns
:) kudos for learning something new 🙂