Spark Read JSON file - UnderstandingBigData

In this blog we will understand how to read a Json file using Spark and load it into a dataframe. All the code examples is written in Scala. Remember that Spark automatically infers the schema while reading the json file hence we don’t have to use option(“inferSchema”,true)

Table of Contents

read single line json file using spark scala

Before we discuss anything else first we need to understand what is single line json . A file is said to have single line json when each json is stored in a single line. In other words you can say the file is new line(\n) delimited. Lets see an example of such file.

{"name":"Johnny", "age":35, "car":"Unknown"}
{"name":"Mark", "age":41, "car":"Unknown"}

To read this file in dataframe Spark has built in json reader. The syntax is spark.read.json(“path”). The code to read is as below

 val df = spark.read.json("C:\\Users\\path\\json_test_files\\jsonexample.json")
 df.show

Output:
  +---+-------+------+
  |age|    car|  name|
  +---+-------+------+
  | 35|Unknown|Johnny|
  | 41|Unknown|  Mark|
  +---+-------+------+

Once the data is in dataframe format you can apply all the dataframe operations and get the desired result.

read multiline json file using spark scala

Similar to single line json file lets first understand what is multiline json . A file is said to be multiline when each json is stored in multiple lines. Lets see an example of such file.

[{"name":"Johnny", 
"age":35, 
"car":"Unknown"},
{"name":"Mark", 
"age":41, 
"car":"Unknown"}
]

If you read the above file using spark.read.json() then you get error. To resolve this you need to add multline option.

spark.read..option(“multiLine”,true).json()

 val df = spark.read.option("multiLine",true).json("C:\\Users\\path\\json_test_files\\jsonexample.json")
 df.show

  Output:
  +---+-------+------+
  |age|    car|  name|
  +---+-------+------+
  | 35|Unknown|Johnny|
  | 41|Unknown|  Mark|
  +---+-------+------+

read multiple json file in a folder using spark scala

To read all the json files present inside the folder we need to use the same code as above, the only thing that will change is the path. Instead of including the file name in the path we need to only provide the path till the folder location. Lets check the code below

 val df = spark.read.option("multiLine",true).json("C:\\Users\\path\\New folder\\excel_test_files\\")
 df.show

  Output:
  +---+-------+-------+
  |age|    car|   name|
  +---+-------+-------+
  | 35|Unknown| Johnny|
  | 41|Unknown|   Mark|
  | 23|Unknown|Phillip|
  | 23|Unknown|  Lucas|
  +---+-------+-------+

read specific json files in a folder using spark scala

To read specific json files inside the folder we need to pass the full path of the files comma separated. Lets say the folder has 5 json files but we need to read only 2. This is achieved by specifying the full path comma separated.

 val df = spark.read.option("multiLine",true)
   .json("C:\\Users\\anurag.das\\Desktop\\path\\jsonexample1.json",
         "C:\\Users\\anurag.das\\Desktop\\path\\jsonexample2.json"   )

 df.show

  Output:
  +---+-------+-------+
  |age|    car|   name|
  +---+-------+-------+
  | 35|Unknown| Johnny|
  | 41|Unknown|   Mark|
  | 23|Unknown|Phillip|
  +---+-------+-------+

user specified custom schema to read file

When we use spark.read.json() then spark automatically infers the schema. But what if we want to provide the schema of our own. We can achieve this using StructType to define the schema before hand.

Sample File:

[{"name":"Johnny",
 "age":35,
 "car":"Unknown"},
{"name":"Mark",
 "age":41,
 "car":"Unknown"}]

Lets first check the schema output when we let spark infer the schema

 val df = spark.read.option("multiLine",true)
   .json("C:\\Users\\anurag.das\\Desktop\\New folder\\excel_test_files\\jsonexample1.json")

 df.printSchema()

  root
  |-- age: long (nullable = true)
  |-- car: string (nullable = true)
  |-- name: string (nullable = true)

Schema Creation: Remember to keep the same column name while creating schema variable as is present in the json file.

val jsonSchema = new StructType()
    .add("name",StringType, true)
    .add("age",IntegerType, true)
    .add("car",StringType, true)

Let’s check the schema output when we specify the schema

 val df = spark.read.schema(jsonSchema).option("multiLine",true)
   .json("C:\\Users\\anurag.das\\Desktop\\New folder\\excel_test_files\\jsonexample1.json")

df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- car: string (nullable = true)

You can clearly see the age column was inferred as long by spark but now its integer.

read nested json file using spark scala

When a json file has other json objects inside them then it is known as nested json. Reading the file is easy but to covert into a tabular format could be tricky. For this we may need to use the explode function. Let’s see an example of the same

Sample Nested Json File:

{ 
  "accounting" : [   
                     { "firstName" : "Johnny",  
                       "lastName"  : "Walk",
                       "age"       : 33 },

                     { "firstName" : "Henry",  
                       "lastName"  : "Smith",
                        "age"      : 33 }
                 ]
} 

FINAL OUTPUT:
+---+---------+--------+
|age|firstName|lastName|
+---+---------+--------+
| 33|   Johnny|    Walk|
| 33|    Henry|   Smith|
+---+---------+--------+

Below is the code using which we can convert the above nested json into a tabular format data. First we will read the json and check its output and schema.

 val df = spark.read.option("multiLine",true).json("C:\\Users\\path\\jsonexample3.json")
 df.show(false)
  +----------------------------------------+
  |accounting                              |
  +----------------------------------------+
  |[[33, Johnny, Walk], [33, Henry, Smith]]|
  +----------------------------------------+

  df.printSchema()
  root
  |-- accounting: array (nullable = true)
  |    |-- element: struct (containsNull = true)
  |    |    |-- age: long (nullable = true)
  |    |    |-- firstName: string (nullable = true)
  |    |    |-- lastName: string (nullable = true)

We have read the file properly now. On checking the print schema output we see that the accounting column is an ARRAY. And to split the values in an array into multiple rows we need to use EXPLODE.

  import org.apache.spark.sql.functions._   
  val df2 = df.withColumn("acc",explode(col("accounting")))
              .drop(col("accounting"))
  
  df2.show(false)
  +------------------+
  |acc               |
  +------------------+
  |[33, Johnny, Walk]|
  |[33, Henry, Smith]|
  +------------------+

  df2.printSchema()
  root
  |-- acc: struct (nullable = true)
  |    |-- age: long (nullable = true)
  |    |-- firstName: string (nullable = true)
  |    |-- lastName: string (nullable = true)

In the above code we achieved :
1.imported org.apache.spark.sql.functions._ to use export and col functions.
2.exploded the column accounting and created a new column acc.
3.dropped the accounting column as it was no longer required.
4.on printing the schema, we notice that the datatype of acc column is structype. This can be easily read using select statements. We will see below how it can be done.

df2.select(col("acc.*")).show(false)
  //To get all columns
  +---+---------+--------+
  |age|firstName|lastName|
  +---+---------+--------+
  |33 |Johnny   |Walk    |
  |33 |Henry    |Smith   |
  +---+---------+--------+

  df2.select(col("acc.firstname"),col("acc.lastname")).show(false)
  //To get specific columns
  +---------+--------+
  |firstname|lastname|
  +---------+--------+
  |Johnny   |Walk    |
  |Henry    |Smith   |
  +---------+--------+

You can learn about explode function in Hive in this blog post.

🙂 kudos for learning something new:)