Many of the datatypes you find in Relational Databases can be found in Hive as well. They are called “Primitive Datatypes“. Along with these , Hive also has “Collection Datatypes” which we will discuss later in this blog. Lets check all the Hive Data Types one by one .
HIVE Primitive Data Types
The different categories of Primitive Data Types are as follows
- Numeric Data Types
- String Data Types
- Date/Time Data Types
- Miscellaneous Data Types
Numeric Data Types
Different Numeric Data Types supported in Hive are
- TINYINT (1-byte signed integer, from -128 to 127)
- SMALLINT (2-byte signed integer, from
- INT/INTEGER (4-byte signed integer, from -2,147,483,648 to 2,147,483,647)
- BIGINT (8-byte signed integer, from
- FLOAT (4-byte single precision floating point number)
- DOUBLE (8-byte double precision floating point number)
- DOUBLE PRECISION (This is an alias of Double)
- DECIMAL (precision ,scale)
- NUMERIC (same as DECIMAL from Hive 3.0)
Points to keep in mind for Numeric Datatype:
- When we provide a number , Hive by default considers it to be of Integer Data Type.
- When the number is bigger than Integer Range, Hive automatically considers it as BigInt.
- If we want to specify that the number is of different Type then we need to assign Post Fix.
- To specify that its TINYINT , post fix the Integer with Y. Ex 20Y.
- To specify that its SMALLINT , post fix the Integer with S. Ex 20S.
- To specify that its BIGINT , post fix the Integer with L. Ex 20L.
- The default datatype of DECIMAL is (10,0) when precision and scale are not defined. And max precision allowed is 38.
String Data Types
Hive supports 3 types of String Datatypes CHAR ,VARCHAR ,STRING.
- CHAR – Similar to other programming languages this is of fixed length. If you define char(10) and the input value is of 6 characters then the remaining 4 will be filled with spaces.
- VARCHAR – This works similar to CHAR but instead of fixed length this takes variable length. If you define varchar(10) and the input value is of 6 characters then only 6 bytes will be used and no additional space will be blocked. If the string length is greater than the precision, then the value is truncated and displayed.
- STRING – Strings are expressed with single quotes ( ‘ ‘ ) or double quotes ( ” ” ). The maximum length of String is debatable but i found a good answer in StackOverflow for this which suggest it could be 2 GB.
Date/Time Data Types
Hive supports 3 types TIMESTAMP , DATE and INTERVAL.
- TIMESTAMP – Supports UNIX timestamp with optional nanosecond precision. ex: 2020-011-21 08:46:05.296
- If input is of type Integer ,it is interpreted as UNIX timestamp in seconds
- If input is of type Floating Point ,it is interpreted as UNIX timestamp in seconds with decimal precision
- I f input is of type String, it follows java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” (9 decimal place precision)
- DATE – It specifies the date in YEAR / MONTH /DATE format as YYYY-MM-DD. It does not store any time value.
Note that you can convert Timestamp and String to Date. And also Date to Timestamp and String.
Miscellaneous Data Types
Hive supports 2 miscellaneous data types Boolean and Binary.
- Boolean – Accepts TRUE or FALSE.
- Binary – This stores array of bytes.
HIVE Complex Data Types
Hive supports 3 types of Complex Data Types STRUCT , MAP and ARRAY. They are also know as collection or nested datatypes. They can store multiple values in a single row/column . These data types are not supported by most of the relation databases.
Just like other programming languages it is a collection of elements of similar data type. The elements are maintained in an index , you can retrieve the value like column_name[index_num].
Lets say we have a file with 2 columns, Name and Score. The Score column contains values separated by $
Now lets create a table and display the first score of each player.
create table game ( name string , score array<int> ) row format delimited fields terminated by ',' collection items terminated by '$'; select name,score as first from game; name,first Smith,23 Kevin,45
You see that we were able to fetch the first score by passing the index to column name. score. Also note that we have to specify the separator in Array column using collection items terminated by.
This is a collection of named fields where each field can be of any primitive datatype. Fields in Struct can be accessed using Dot(.) operator. Column_name.Field_name
Lets say the data is as below. You can see that the first column contains the Player name, and next column contains 2 information Game name and Score.
In a single column we have 2 different types of data which needs to be stored using different Datatypes. In such case we use Struct Datatype.
create table game ( name string , gamescore struct<game_name:string,score:int> ) row format delimited fields terminated by ',' collection items terminated by '$'; select name,gamescore.game_name as game_name,gamescore.score as score from game; name,game_name,score Smith,Carrom,3 Kevin,Chess,6
Just like Array we have to specify the separator in STRUCT column using collection items terminated by
It is a collection of Key Value pairs. Here the values can be accessed by providing the column_name[keys].
Lets say we have data as below. The first column is Name field, and second column contains key value pairs of <subject:marks>
When we need to store the data with such key value pairs, we can use Map Data type.
create table marks ( name string , mark map<string,int> ) row format delimited fields terminated by ',' collection items terminated by '$' map keys terminated by ':'; select name,mark[Physics] as Physics_Mark from marks; name,Physics_Mark Smith,37 Kevin,87
Just like Array and STRUCTS we have to specify the separator in MAP column using collection items terminated by and also map keys terminated by.
In the next chapter we will learn about