In Previous chapter we learned about  Spark Dataframe Actions  and today lets check out How to replace null values in Spark Dataframe. It is really important to handle null values in dataframe if we want to avoid null pointer exception. For this Spark Dataframe API has a DataFrameNaFunctions class with fill( ) function. In this post we will see various ways to use this function.

The DataFrameNaFunctions class also has a drop( ) function which drops the rows if they have null values. You can read more about them in this blog.

Replace Null with specific value

Here we will see how we can replace all the null values in a dataframe with a specific value using fill( ) funtion. The syntax is simple and is as follows df.na.fill(<value>) . Lets check this with an example.

Below we have created a dataframe having 2 columns [fnm , lnm]. Some rows have null values. Now let us populate default “abc” values everywhere we have null.

scala> import spark.implicits._

scala> val df = Seq(("David","Kallis"),("Sachin",null),(null,"Butler")).toDF("fnm","lnm")
df: org.apache.spark.sql.DataFrame = [fnm: string, lnm: string]

scala> df.show
+------+------+
|   fnm|   lnm|
+------+------+
| David|Kallis|
|Sachin|  null|
|  null|Butler|
+------+------+

scala> df.na.fill("abc").show
+------+------+
|   fnm|   lnm|
+------+------+
| David|Kallis|
|Sachin|   abc|
|   abc|Butler|
+------+------+

In the above output you see that all the null values are replaced with “abc” . But what if we want to replace only for a specific column . For example what if we want to replace null only for fnm column and we don’t replace nulls in lnm column . Can it be done ? Well ofcourse 🙂 , let’s see how.

Replace Null in Specific Dataframe Column

In the above example we saw how all the null values where replaced by specific values for all columns. Here we will only replace null values of specific column. Let’s use the same dataframe created in the first example above. The syntax is simple and is as follows df.na.fill(<value> , Seq(<col name>)) . Let’s check this with an example.

scala> df.na.fill("abc",Seq("fnm")).show
+------+------+
|   fnm|   lnm|
+------+------+
| David|Kallis|
|Sachin|  null|
|   abc|Butler|
+------+------+

So from the output you see that the null values of only fnm column is replaced and lnm column nulls remain as is. Ok if we are clear till now, lets go one step further. Can we replace fnm null column values with ‘abc’ and lnm null column values with ‘xyz’ ? Well ofcourse 🙂 , let’s see how.

Replace Null in different Dataframe column with different values

Till now we have understood how to replace null values in every column with alternate value and also replace null values in specific column. Here we will learn how to replace Null in different Dataframe column with different values . For this we create a dataframe with 3 column [firstnm , middlenm , lastnm]. Now our objective is to replace firstnm and middlenm column null values with ‘abc’ and lastnm null column values with ‘xyz’. The syntax is simple and is as follows

df.na.fill(value: String, cols: Seq[String])

scala> val df = Seq(("Jack",null,"Bower"),(null,"Kito","Hawagi"),("Jecklyn","Ditto",null)).toDF("firstnm","middlenm","lastnm")
df: org.apache.spark.sql.DataFrame = [firstnm: string, middlenm: string ... 1 more field]

scala> df.show
+-------+--------+------+
|firstnm|middlenm|lastnm|
+-------+--------+------+
|   Jack|    null| Bower|
|   null|    Kito|Hawagi|
|Jecklyn|   Ditto|  null|
+-------+--------+------+

scala> df.na.fill("abc",Seq("firstnm","middlenm")).na.fill("xyz",Seq("lastnm")).show
+-------+--------+------+
|firstnm|middlenm|lastnm|
+-------+--------+------+
|   Jack|     abc| Bower|
|    abc|    Kito|Hawagi|
|Jecklyn|   Ditto|   xyz|
+-------+--------+------+

🙂 kudos for learning something new 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading