Lets understand the difference with the help of examples. Lets create a table Department having Name and DeptId.
ORDER BY :
Defn: It guarantees global ordering, but the demerit is all data is pushed through into one reducer. This is an issue when dealing with large datasets . The output is a single sorted data. Therefore strict mode, hive makes it compulsory to use LIMIT with ORDER BY to reduce the burden from reducer.
Ordering: It guarantees global ordering.
set mapred.reduce.tasks=2; ex: Select * from department order by deptid Name DeptId abh 5 abv 10 poi 13 dec 15 pin 20
Output: The final output is completely sorted
SORT BY :
Defn: Data is sent to N reducers . Before sending the data they are sorted based on the sort column, but this does not guarantee that the final output from all the reducers will be a sorted one. The sort order is dependent on column type , which means if the column is numeric type then the sort order is numeric and if the column is string type then the sort order is lexicographical type.
Ordering: The data in each reducer can have over lapping ranges.
set mapred.reduce.tasks=2; ex: Select * from department sort by deptid Name DeptId poi 13 dec 15 pin 20 abh 5 abv 10
Output: 5 records were sent to 2 reducers. Now deptid 5,15,20 was sorted in 1st reducer and deptid 10,13 were sorted in 2nd reducer. As you can see the data is ordered in each reducer but the over all odering is missing.
DISTRIBUTE BY :
Defn: It ensures each of N reducers gets non-overlapping ranges of x i.e same values in a distribute by column go to the same reducer, but doesn’t sort the output of each reducer. You end up with N or more unsorted files with non-overlapping ranges.
Ordering: Data is not ordered.
set mapred.reduce.tasks=2; ex: Select * from department sort by deptid Name DeptId dec 15 poi 13 pin 20 abh 5 abv 20
Output: The output will not be sorted. 15,13 was sent to 1st reducer and deptid 20,5,20 was sent to 2nd reducer.
DISTRIBUTE BY + SORT BY:
We can use a combination of DISTRIBUTE BY + SORT BY. In this the data will first get distributed to reducers and then the data will be sorted in respective reducers.
ex: Select * from department distribute by deptid sort by name Name DeptId poi 13 dec 15 abh 5 abv 10 pin 13
Output: dec,poi,pin was sent to 1st reducer deptid 13,13,15 were sorted.Similarly abh,abv was sent to 2nd reducer and deptid 5,10 were sorted.
CLUSTER BY :
Defn: This is basically(DISTRIBUTE BY plus SORT BY) .It ensures each of N reducers gets non-overlapping ranges(DISTRIBUTE BY), then sorts(SORT BY) by those ranges at the reducers.
Ordering: You end up with N or more sorted files with non-overlapping ranges. This also does not guarantee global sorting.
set mapred.reduce.tasks=2; ex: Select * from department cluster by deptid Name DeptId abh 5 poi 13 dec 15 abv 10 pin 13
Output: dept id was distributed 5,13,13 went to reducer 1 and 10,25 went to reducer 2. And then data was sorted on the same dept id column.
Note: the difference between [DISTRIBUTE BY + SORT BY] and CLUSTER BY is , in 1st case we can distribute by column A and sort by column B but in 2nd case the Distribute and Sort happens on a single column A.