In this blog we will create a Spark UDF to Check Count of Nulls in each column. There could be a scenario where we would need to find the number of [nulls , ‘NA’ , “” , etc] in each column . This could help in analysis of the quality of data. Let us see how we can achieve this.

First, we need to create a function which defines which all conditions we need to check. For example we can have null check & if column value is NA or if its an empty string. Below we have created a checks function which does all the tests we want. Incase you need to add more checks you can add them.

def checks(c:Column)={
      isnull(c) || c == "" || c == "NA"
    }

Next we will create the actual function which counts the number of null/””

def nullsCountinEachColumn(df:DataFrame):DataFrame ={

    def checks(c:Column)={
      isnull(c) || c == "" || c == "NA"
    }

    val colsToChk: Array[Column] = df.columns.map(c => sum(checks(col(c)).cast("int").as(s"Count of $c")))
    df.select(colsToChk:_*)

  }

The above function takes a dataframe as input. Maps over each column and pass the column to the check function. If we run print on colsToChk the output would be.

sum(CAST((((show IS NULL) OR false) OR false) AS INT) AS `Count of show`)
sum(CAST((((type IS NULL) OR false) OR false) AS INT) AS `Count of type`)
sum(CAST((((title IS NULL) OR false) OR false) AS INT) AS `Count of title`)
sum(CAST((((director IS NULL) OR false) OR false) AS INT) AS `Count of director`)
sum(CAST((((cast IS NULL) OR false) OR false) AS INT) AS `Count of cast`)
sum(CAST((((country IS NULL) OR false) OR false) AS INT) AS `Count of country`)
sum(CAST((((date_added IS NULL) OR false) OR false) AS INT) AS `Count of date_added`)
sum(CAST((((release_year IS NULL) OR false) OR false) AS INT) AS `Count of release_year`)
sum(CAST((((rating IS NULL) OR false) OR false) AS INT) AS `Count of rating`)
sum(CAST((((duration IS NULL) OR false) OR false) AS INT) AS `Count of duration`)
sum(CAST((((listed_in IS NULL) OR false) OR false) AS INT) AS `Count of listed_in`)
sum(CAST((((description IS NULL) OR false) OR false) AS INT) AS `Count of description`)

The final output when we run the complete function would look as below. This shows the count of null in each column.

Also if you want to create a UDF on how to check if duplicates exists in Dataframe then you can see here.

🙂 kudos for learning something new 🙂

1 thought on “Spark UDF to Check Count of Nulls in each column

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading