Hive - Order By vs Sort By vs Distribute By vs Cluster By

Lets understand the difference with the help of examples. Lets create a table Department having Name and DeptId.

ORDER BY :
Defn: It guarantees global ordering, but the demerit is all data is pushed through into one reducer. This is an issue when dealing with large datasets . The output is a single sorted data. Therefore strict mode, hive makes it compulsory to use LIMIT with ORDER BY to reduce the burden from reducer.

Ordering: It guarantees global ordering.

set mapred.reduce.tasks=2;
ex: Select * from department order by deptid
Name    DeptId
abh      5
abv      10
poi      13
dec      15
pin      20

Output: The final output is completely sorted

SORT BY :
Defn: Data is sent to N reducers . Before sending the data they are sorted based on the sort column, but this does not guarantee that the final output from all the reducers will be a sorted one. The sort order is dependent on column type , which means if the column is numeric type then the sort order is numeric and if the column is string type then the sort order is lexicographical type.

Ordering: The data in each reducer can have over lapping ranges.

set mapred.reduce.tasks=2;
ex: Select * from department sort by deptid
Name    DeptId
poi      13 
dec      15
pin      20
abh      5
abv      10

Output: 5 records were sent to 2 reducers. Now deptid 5,15,20 was sorted in 1st reducer and deptid 10,13 were sorted in 2nd reducer. As you can see the data is ordered in each reducer but the over all odering is missing.

DISTRIBUTE BY :
Defn: It ensures each of N reducers gets non-overlapping ranges of x i.e same values in a distribute by column go to the same reducer, but doesn’t sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.

Ordering: Data is not ordered.

set mapred.reduce.tasks=2;
ex: Select * from department sort by deptid
Name    DeptId
dec      15
poi      13 
pin      20
abh      5
abv      20

Output: The output will not be sorted. 15,13 was sent to 1st reducer and deptid 20,5,20 was sent to 2nd reducer.

DISTRIBUTE BY + SORT BY:
We can use a combination of DISTRIBUTE BY + SORT BY. In this the data will first get distributed to reducers and then the data will be sorted in respective reducers.

ex: Select * from department distribute by deptid sort by name
Name    DeptId
poi      13 
dec      15
abh      5
abv      10
pin      13

Output: dec,poi,pin was sent to 1st reducer deptid 13,13,15 were sorted.Similarly abh,abv was sent to 2nd reducer and deptid 5,10 were sorted.

CLUSTER BY :
Defn: This is basically(DISTRIBUTE BY plus SORT BY) .It ensures each of N reducers gets non-overlapping ranges(DISTRIBUTE BY), then sorts(SORT BY) by those ranges at the reducers.

Ordering: You end up with N or more sorted files with non-overlapping ranges. This also does not guarantee global sorting.

set mapred.reduce.tasks=2;
ex: Select * from department cluster by deptid
Name    DeptId
abh      5
poi      13 
dec      15
abv      10
pin      13

Output: dept id was distributed 5,13,13 went to reducer 1 and 10,25 went to reducer 2. And then data was sorted on the same dept id column.

Note: the difference between [DISTRIBUTE BY + SORT BY] and CLUSTER BY is , in 1st case we can distribute by column A and sort by column B but in 2nd case the Distribute and Sort happens on a single column A.

Hive – Order By vs Sort By vs Distribute By vs Cluster By

Leave a ReplyCancel reply

Leave a ReplyCancel reply

Discover more from UnderstandingBigData