The data we normally deal with may not be clean. In such cases we may need to clean the data by applying some logic . One such case is presence of null values in rows. We can handle it by dropping the spark dataframe rows using the drop() function .
Table of Contents
drop rows with null data in any column
It may so happen that you need to drop the entire row when any column value is null. In that case we can use the df.na.drop() or df.na.drop(“any”) function. Note that both give the same result.
Lets check this with an example
val df1 = Seq(("Sam Mendis","23"),("Henry Ford",null),(null,null)).toDF("Name","Age") //both give same result df1.na.drop.show df1.na.drop("any").show +----------+---+ | Name|Age| +----------+---+ |Sam Mendis| 23| +----------+---+
As you can see in the above example, we have two records in which at-least one of the column values is null. Hence the entire record is dropped.
drop rows with null values in every single column
In the previous example we saw that the record gets dropped when any column value is null, but what if your requirement is different. What if you need to drop the row only when the entire record has only null values. You can achieve this using df.na.drop(“all”).
val df1 = Seq(("Sam Mendis","23"),("Henry Ford",null),(null,null)).toDF("Name","Age") df1.na.drop("all").show +----------+----+ | Name| Age| +----------+----+ |Sam Mendis| 23| |Henry Ford|null| +----------+----+
As you can see in the above example only the 3 record which has nulls in all column is dropped. The Second row is not dropped.
drop rows when specific column has null values
Using this we can decide to drop rows only when a specific column has null values. The syntax is a s follows df.na.drop(Array(“col_nm1”,”col_nm2″…)).
Note: Providing multiple columns doesn’t mean that the row will be dropped if null is present in all the mentioned columns. It means that the row is dropped if any of the mentioned column has null values.
Lets check this with an example.
val df1 = Seq(("Sam Mendis","23","5.3"),("Henry Ford",null,"5.6"),("Ford","33",null)).toDF("Name","Age","Height") df1.na.drop(Array("Age","Height")).show +----------+---+------+ | Name|Age|Height| +----------+---+------+ |Sam Mendis| 23| 5.3| +----------+---+------+
In the above example you see that we are trying to drop the rows if there is null values in either Age column or Height Column. Hence we got only 1 record in output as 2nd record has null Age column and 3rd record has null Height Column.
Conclusion
So today we learnt how to
- drop rows if null value is present in any column of the Spark Dataframe
- drop rows only when all the column values in a row are nulls.
- drop rows when specific column has null values
🙂 kudos for learning something new 🙂