Both Spark distinct and dropDuplicates function helps in removing duplicate records. One additional advantage with dropDuplicates() is that you can specify the columns to be used in deduplication logic. We will see the use of both with couple of examples.

SPARK Distinct Function

The Spark distinct() function is by default applied on all the columns of the dataframe. If you need to apply on specific columns then first you need to select them. Lets check an example.

Create a dataframe with Name , Age and , Height column.

val dfTN = Seq(("Smith",23,5.3),("Rashmi",27,5.8),("Smith",23,5.3),("Payal",27,5.8)).toDF("Name","Age","Height")
dfTN.show()

+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
|Rashmi| 27|   5.8|
| Smith| 23|   5.3|
| Payal| 27|   5.8|
+------+---+------+

As you can see Smith 23 5.3 is repeated twice. Lets remove this duplicate using Spark distinct() function.

dfTN.distinct.show()

+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
| Payal| 27|   5.8|
|Rashmi| 27|   5.8|
+------+---+------+

You see that now the duplicate records are removed. But what if you wanted the distinct of only Age and Height column and didn’t care about the Name column. In that case we can first select the column we need and then apply a distinct() function.

dfTN.select('Age,'Height).distinct.show()

+---+------+
|Age|Height|
+---+------+
| 23|   5.3|
| 27|   5.8|
+---+------+

You see that only 2 records were returned with distinct Age and Height values. But what if you wanted the Name column as well. For example for Age 27 and Height 5.8 we have 2 names Rashmi and Payal and we wanted to display any one of them. In such case we can use dropDuplicates() function.

Spark dropDuplicates() Function

Spark dropDuplicates() Function takes Columns as arguments on which the deduplication logic is to be applied. If no columns are passed then it works like distinct() function.

Lets create the same dataframe as above and use dropDuplicates() on them.

val dfTN = Seq(("Smith",23,5.3),("Rashmi",27,5.8),("Smith",23,5.3),("Payal",27,5.8)).toDF("Name","Age","Height")
dfTN.show()

+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
|Rashmi| 27|   5.8|
| Smith| 23|   5.3|
| Payal| 27|   5.8|
+------+---+------+

//Applying dropDuplicates() on entire dataframe
+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
| Payal| 27|   5.8|
|Rashmi| 27|   5.8|
+------+---+------+

As you see above when dropDuplicates() is applied on the entire dataframe the output is same as that of distinct() function. Now lets apply the function on specific columns.

dfTN.dropDuplicates("Age","Height").show()

+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
|Rashmi| 27|   5.8|
+------+---+------+

Here you see that along with distinct Age and Height Spark has also returned the Name Column. This is a good way of getting all the other columns in the dataframe apart from the once on which we are applying the rule. But if you carefully notice , for Age=27 and Height=5.8 we had 2 options “Rashmi” and “Payal” , so how did Spark decide what to keep and what to let go?

I have struggled to get this answer myself. At some placed its mentioned that the first occurrence value is display and all the later once are ignored. Some use sort() function to sort the data according to what they need before applying the dropDuplicates() function to get the desired rows. But many argue that this is not strictly correct. You can check the StackOverflow discussion as well.

dropDuplicates() error: type mismatch

error: type mismatch is a common error I have been facing while using dropDuplicates() function and let me show you when this comes. I will be using the same dataframe which I have created in the above examples.

dfTN.dropDuplicates(col("Age"),col("Height")).show()
error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: String

dfTN.dropDuplicates('Age,'Height).show()
error: type mismatch;
 found   : Symbol
 required: String

What Spark is basically trying to tell us is that we cannot use instance of a Column Class as parameter. Rather we should be using Column as String Type . Here do not get confused that i am saying it works on only Column with String DataType, No that’s not what i am saying. I would suggest you to check this article on Spark Column Types.

To resolve this issue we need to use Column as String Type i.e columns inside double quotes “columns.”. Lets check it

dfTN.dropDuplicates("Age","Height").show()

+------+---+------+
|  Name|Age|Height|
+------+---+------+
| Smith| 23|   5.3|
|Rashmi| 27|   5.8|
+------+---+------+

Conclusion

So today we learnt

  • What is distinct() function in spark and how to use it.
  • What is dropDuplicate() function in spark and how is it different from distinct().
  • How to resolve error : type mismatch while using Spark dropDuplicates() function.

🙂 kudos for learning something new 🙂

1 thought on “SPARK distinct and dropDuplicates

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading