The data we normally deal with may not be clean. In such cases we may need to clean the data by applying some logic . One such case is presence of null values in rows. We can handle it by dropping the spark dataframe rows using the drop() function .

drop rows with null data in any column

It may so happen that you need to drop the entire row when any column value is null. In that case we can use the df.na.drop() or df.na.drop(“any”) function. Note that both give the same result.

Lets check this with an example

val df1 = Seq(("Sam Mendis","23"),("Henry Ford",null),(null,null)).toDF("Name","Age")

//both give same result
df1.na.drop.show
df1.na.drop("any").show

+----------+---+
|      Name|Age|
+----------+---+
|Sam Mendis| 23|
+----------+---+

As you can see in the above example, we have two records in which at-least one of the column values is null. Hence the entire record is dropped.

drop rows with null values in every single column

In the previous example we saw that the record gets dropped when any column value is null, but what if your requirement is different. What if you need to drop the row only when the entire record has only null values. You can achieve this using df.na.drop(“all”).

val df1 = Seq(("Sam Mendis","23"),("Henry Ford",null),(null,null)).toDF("Name","Age")

df1.na.drop("all").show

+----------+----+
|      Name| Age|
+----------+----+
|Sam Mendis|  23|
|Henry Ford|null|
+----------+----+

As you can see in the above example only the 3 record which has nulls in all column is dropped. The Second row is not dropped.

drop rows when specific column has null values

Using this we can decide to drop rows only when a specific column has null values. The syntax is a s follows df.na.drop(Array(“col_nm1”,”col_nm2″…)).

Note: Providing multiple columns doesn’t mean that the row will be dropped if null is present in all the mentioned columns. It means that the row is dropped if any of the mentioned column has null values.

Lets check this with an example.

val df1 = Seq(("Sam Mendis","23","5.3"),("Henry Ford",null,"5.6"),("Ford","33",null)).toDF("Name","Age","Height")

df1.na.drop(Array("Age","Height")).show

+----------+---+------+
|      Name|Age|Height|
+----------+---+------+
|Sam Mendis| 23|   5.3|
+----------+---+------+

In the above example you see that we are trying to drop the rows if there is null values in either Age column or Height Column. Hence we got only 1 record in output as 2nd record has null Age column and 3rd record has null Height Column.

Conclusion

So today we learnt how to

  • drop rows if null value is present in any column of the Spark Dataframe
  • drop rows only when all the column values in a row are nulls.
  • drop rows when specific column has null values

🙂  kudos for learning something new 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading