Spark Lazy Evaluation - UnderstandingBigData

Today we will learn about Spark Lazy Evaluation. We will learn about what it is, why is it required, how spark implements them, and what is its advantage. We know that Spark is written in Scala and Scala has an option to run lazily [You can check the lesson here] but for Spark, the execution is Lazy by default.

Table of Contents

What is Spark Lazy Evaluation

What Lazy Evaluation in Sparks means is, Spark will not start the execution of the process until an ACTION is called. We all know from previous lessons that Spark consists of TRANSFORMATIONS and ACTIONS. Until we are doing only transformations on the dataframe/dataset/rdd, Spark is least concerned. Once Spark sees an ACTION being called, it starts looking at all the transformation and creates a DAG[Simple put its a list of all the operations that need to be called].

Let me explain to you using a simple example. One day my Mom asked me to buy some Spices. So I went to the grocery store and bought them. After I returned she asked me to go to the vegetable store and buy a few vegetables. So I went and bought them. After returning my Mom says that she may not need some of the vegetables and asked me to return. So I went back again. Now how this could have worked differently. Instead of immediately going to market to fetch the Spices I could have waited to get all the instructions from my Mom together. Hence saving a lot of time and resources.

Similarly, if Spark could wait till an Action is called, then it may merge some transformation or totally skip some unnecessary transformation and prepare a perfect execution plan.

Lazy Evaluation Example

Here let’s understand how Lazy Evaluation works using an example. Let’s create a Dataframe with 1 column having values 1 to 100000.

val df1 = (1 to 100000).toList.toDF("col1")

Below are the 2 scenarios that we will create using this Dataframe.

Scenario 1

Add a column to the dataframe named “col2”.
Print the output of the dataframe.

Scenario 2

Add a column to the dataframe named “col2”.
Drop the column “col2”.
Print the output of the dataframe.

Now when we look at both the scenarios, ideally the Scenario 2 should take more time as it involves both creating and dropping a column. But here comes the interesting part. Just because Spark is Lazy it realizes that the creation of col2 is of no value and it completely ignores that step. And due to this Laziness, the Job runs faster. Let’s check the proof below.

Proof 1: Using Timings

Executing Scenario 1

scala> :paste
// Entering paste mode (ctrl-D to finish)
val t0 = System.currentTimeMillis();
df1.withColumn("col2",lit(2)).show(false);
val t1 = System.currentTimeMillis();
println("Time Diff " + (t1 - t0));
// Exiting paste mode, now interpreting.
Time Diff 7097
t0: Long = 1592059495938
t1: Long = 1592059503035

Executing Scenario 2

scala> :paste
// Entering paste mode (ctrl-D to finish)
val t0 = System.currentTimeMillis();
df1.withColumn("col2",lit(2)).drop('col2).show(false)
val t1 = System.currentTimeMillis();
println("Time Diff " + (t1 - t0));
// Exiting paste mode, now interpreting.
Time Diff 6183
t0: Long = 1592059586374
t1: Long = 1592059592557

So here we see that Scenario 1 which has fewer steps took 7097 milliseconds. But on the other hand Scenario 2 which had more steps took just 6183 milliseconds.

Proof 2: Using Physical Plans

Let’s check the plans of both scenarios.

Scenario 1 PLANS

== Analyzed Logical Plan ==
col1: string, col2: string
GlobalLimit 21
+- LocalLimit 21
   +- Project [cast(col1#3 as string) AS col1#54, cast(col2#49 as string) AS col2#55]
      +- Project [col1#3, 2 AS col2#49]
         +- Project [value#1 AS col1#3]
            +- LocalRelation [value#1]

== Optimized Logical Plan ==
LocalRelation [col1#54, col2#55]

== Physical Plan ==
LocalTableScan [col1#54, col2#55]

Scenario 2 PLANS

== Analyzed Logical Plan ==
col1: string
GlobalLimit 21
+- LocalLimit 21
   +- Project [cast(col1#3 as string) AS col1#76]
      +- Project [col1#3]
         +- Project [col1#3, 2 AS col2#71]
            +- Project [value#1 AS col1#3]
               +- LocalRelation [value#1]

== Optimized Logical Plan ==
LocalRelation [col1#76]

== Physical Plan ==
LocalTableScan [col1#76]

When we check the Plans of Scenario 1 you see that its Projecting col1 and col2 and also in optimized logical plan both the columns are present. But when we check Plans for Scenario 2 we do not find col2 in either Analyzed logical plan or Physical Plan. What this means is in scenario 2 Spark simply ignores the processing of Col2.

Advantages of Spark Lazy Evaluation

Users can divide the entire work into smaller operations for easy readability and management. But Spark internally groups the transformations reducing the number of passes on data. What this means is, if Spark could group two transformations into one, then it had to read the data only once to apply the transformations rather than reading twice.
We saw in one of the above examples[Scenario 2] that only the necessary computation is done by Spark which results in an increase in Speed.
As the action is triggered only when data is required this reduces unnecessary overhead. Let’s say a program does the following steps (i)Read a file (ii)Does a function call unrelated to the file (iii)Loads the file into a table. If Spark read the file as soon as it meets the first transformation then it had to wait for the function call to finish before it loads the data and all this while the data had to sit in memory.

Conclusion

So today we understood what is Spark Lazy Evaluation. We saw a few examples of Lazy Evaluation and also saw some Proofs of that. Also, we went through some advantages.

Reference:

spark.apache.org

🙂 kudos for learning something new 🙂