Spark DataFrame API allows us to read CSV file type using [spark.read.csv()]. If the CSV file contains multiple lines then they can be read using [spark.read.option(“multiLine”, true)]. In this blog we will see how Spark reads multiline CSV file with Scala.

what is multiline CSV File

If a row in a csv file spans across multiple lines then it is called a multiline csv. This happens because of presence of next line character in the field “\n”. Lets see an example below.

I have created a single row file containing first name, last name and address as 3 columns. The Address column contains the address in multiple lines. The sample file is as below.

firsntnm,lastnm,address
Jason,Heidi,"15th cross 
 South West Indiana"

Issue while reading Multiline csv file in Spark

Now when we read the file in Spark you can see that the row is split into 2 rows. All the data that is present in next line is treated as a new row. Check the example below.

 val df = spark.
    read.
    option("header",true).
    csv("C:\\Users\\path\\multi_test.csv")

df.show

Output:
  +--------------------+------+-----------+
  |            firsntnm|lastnm|    address|
  +--------------------+------+-----------+
  |               Jason| Heidi|15th cross |
  | South West Indiana"|  null|       null|
  +--------------------+------+-----------+

Spark Read csv file using multiline option

To fix the issue present while reading a multiline csv file we need to use option(“multiLine”,true) while reading the data. Note that when you use this option Spark assumes that you already have enclosed such field with double quotes(“”). You can have the string quoted in different characters as well.

val df = spark.
    read.
    option("header",true).
    option("multiLine",true).
    csv("C:\\Users\\path\\multi\\multi_test.csv")

  +--------+------+-------------------------------+
  |firsntnm|lastnm|address                        |
  +--------+------+-------------------------------+
  |Jason   |Heidi |15th cross    South West Indiana|
    +--------+------+-------------------------------+

After use the option(“multiLine”,true) you can see that the output is coming correct.

Spark Read csv with multiline option with different Quote character

Let’s say that your file has ‘ single quotes rather than double quotes. In such scenario we need to use quote option while reading multiline file. Lets see this with an example.

Sample File:

firsntnm,lastnm,address
Jason,Heidi,'15th cross 
 South West Indiana'

Now to read such a file we need to add one more option option(“quote”,”\'”) . Once applied you can see in below example that the data is correctly loaded.

 val df = spark.
    read.
    option("header",true).
    option("multiLine",true).
    option("quote","\'").
    csv("C:\\Users\\path\\multi_test.csv")

df.show()

  +--------+------+-------------------------------+
  |firsntnm|lastnm|address                        |
  +--------+------+-------------------------------+
  |Jason   |Heidi |15th cross   South West Indiana|
    +--------+------+-------------------------------+

Escape characters inside multiline csv file in Spark

Let’s say that we have a file where the multiple line field is enclosed in double quotes(“”). What if the field itself contains double quotes as well. What will happen if we use only option(“multiLine”,true) option.

Sample File:

firsntnm,lastnm,address
Jason,Heidi,"15th cross 
 South ""West"" Indiana"

Now simply reading this file with multiline option give below incorrect result.

  val df = spark.
    read.
    option("header",true).
    option("multiLine",true).
    csv("C:\\Users\\path\\multi_test.csv")

  df.show(false)

  +--------+------+-------------------------------------+
  |firsntnm|lastnm|address                              |
  +--------+------+-------------------------------------+
  |Jason   |Heidi |"15th cross   South ""West"" Indiana"|
  +--------+------+-------------------------------------+

You can see that the double quotes remain. To remove them use option(“escape”,”\””) option .

  val df = spark.
    read.
    option("header",true).
    option("multiLine",true).
    option("escape","\"").
    csv("C:\\Users\\path\\multi_test.csv")

  df.show(false)

  +--------+------+---------------------------------+
  |firsntnm|lastnm|address                          |
  +--------+------+---------------------------------+
  |Jason   |Heidi |15th cross    South "West" Indiana|
    +--------+------+---------------------------------

Note that in this example we used double quotes in both quote and escape options. But this would work with any characters.

quote: @
escape: ‘

Sample File :

firsntnm,lastnm,address
Jason,Heidi,@15th cross 
 South ''West'' Indiana@

Output:

  val df = spark.
    read.
    option("header",true).
    option("multiLine",true).
    option("quote","@").
    option("escape","\'").
    csv("C:\\Users\\path\\multi_test.csv")

  df.show(false)

  +--------+------+---------------------------------+
  |firsntnm|lastnm|address                          |
  +--------+------+---------------------------------+
  |Jason   |Heidi |15th cross    South 'West' Indiana|
    +--------+------+---------------------------------+

Conclusion

Today we learnt how to read a csv file using Spark Scala when the data is present in multiple lines. We also learnt how to use quote and escape options.

You can also check my other blog posts:

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading