Here we will create a function to check if dataframe has duplicates Here we will not only create one method but will try and create multiple methods.

Table of Contents

Example 1

Here the function dupChk takes a dataframe as input. Now df.count function returns the count of dataframe and df.dropDuplicates removes the duplicates from the dataframe and then we check the count using .count function.

object duplicateCheck {

  def dupChk(df:DataFrame ):Boolean={

    df.count() == df.dropDuplicates().count

  }

}

Example 2

Here dupChk takes a dataframe as input. The point now is to group by all the columns and then check if count of any of these groups is more than 1. If it is more than 1 then it has duplicates else it does not have duplicate. To achieve this we need to provide all columns inside the groupBy function. For this we are creating a dfCol variable which is creating an Array of Column . Next we need to use the aggregate function to count the number of records per group. Finally we use where condition to filter out only those records whose count is more than 1.

We then check the final dataframe is empty or not. If empty it means there were no duplicates.

def dupChk(df:DataFrame):Boolean={

    val dfCol: Array[Column] = df.columns.map(c => col(c))
    val dfGrp: Dataset[Row] = df.groupBy(dfCol:_*).agg(count(lit(1)).as("cnt")).where(col("cnt")>1)

    dfGrp.isEmpty
  }

In this blog we understood various method of checking whether a dataframe has duplicates or not.

🙂 kudos for learning something new 🙂

1 thought on “Spark Function to check Duplicates in Dataframe

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading