Spark Escape Double Quotes in Input File

Here we will see how Spark Escape Double Quotes in Input File. Ideally having double quotes in a column in file is not an issue. But we face issue when the content inside the double quotes also have double quotes along with file separator.

Let’s see an example for this. Below is the data we have in a file. You can see the Address column is in double quotes + it also contains File Separator( comma ). If we read this data we can see output correctly.

Name,Address,BirthMonth,BirthYear
Mike Henry,"27 th cross, Baker Field , Manhatten",Jan,1978
Nicolas Mathhew,"28 th cross, Baker Field, Manhatten",Mar,1976

  val rawDf1: DataFrame = spark.
    read.
    option("header","true").
    csv("C:\\Users\\path\\Name.txt")

  rawDf1.show
+---------------+------------------------------------+----------+---------+
|Name           |Address                             |BirthMonth|BirthYear|
+---------------+------------------------------------+----------+---------+
|Mike Henry     |27 th cross, Baker Field , Manhatten|Jan       |1978     |
|Nicolas Mathhew|28 th cross, Baker Field, Manhatten |Mar       |1976     |
+---------------+------------------------------------+----------+---------+

But let’s say that the address Column also has Double Quotes inside Double Quotes. The data would have looked like below. You can see Baker Field in first row is now in double quotes.

Name,Address,BirthMonth,BirthYear
Mike Henry,"27 th cross, ""Baker Field"" , Manhatten",Jan,1978
Nicolas Mathhew,"28 th cross, Baker Field, Manhatten",Mar,1976

When we read the file now we see the output as below. As you can see Manhattan has now moved to next column.

+---------------+-----------------------------------+-----------+---------+
|Name           |Address                            |BirthMonth |BirthYear|
+---------------+-----------------------------------+-----------+---------+
|Mike Henry     |"27 th cross, ""Baker Field""      | Manhatten"|Jan      |
|Nicolas Mathhew|28 th cross, Baker Field, Manhatten|Mar        |1976     |
+---------------+-----------------------------------+-----------+---------+

To resolve this we need to use Option to escape the Double Quotes while reading the file.

Syntax : option(“escape”,”\””)

val rawDf1: DataFrame = spark.
    read.
    option("header","true").
    option("escape","\"").
    csv("C:\\Users\\path\\Name.txt")

  rawDf1.show(false)

+---------------+--------------------------------------+----------+---------+
|Name           |Address                               |BirthMonth|BirthYear|
+---------------+--------------------------------------+----------+---------+
|Mike Henry     |27 th cross, "Baker Field" , Manhatten|Jan       |1978     |
|Nicolas Mathhew|28 th cross, Baker Field, Manhatten   |Mar       |1976     |
+---------------+--------------------------------------+----------+---------+

Now you see we are getting the data correctly. Also notice the Baker Field is correctly in double quotes as expected.

🙂 kudos for learning something new 🙂

Also, check out my other blog here. Which explains how to create a function which returns the count of nulls in each column in a Dataframe.

Leave a ReplyCancel reply

Discover more from UnderstandingBigData