If we are using an RDD multiple number of times in our program, the RDD will be recomputed everytime. This is a performance issue. To avoid this we store the output of RDD in memory/disk so that the same data can be used multiple time. This is achived by cache and persist.
For example: Lets create a Dataframe which contains number 1 to 10
val df = Seq(1,2,3,4,5,6,7,8,9,10).toDF("num") df: org.apache.spark.sql.DataFrame = [num: int]
Now Dataframe df does not contains the data , it simply says that it will create the data when an action is called.
Next lets take a count of Dataframe
df.count res0: Long = 10
Now count is an action, so first the Dataframe df will read the data and then count it. Now this does not mean that the RDD has stored the data and it can read if its called again.
To make that happen we can use Cache df.cache(), but still this doesnt mean anything. It simply states that when an action is called, it read the data and cache it. When the next action is called ex:- df.count then the data is fetched and cached for further use.
Persists also does the same thing, then that begs the first question.
How Persist is different from Cache
When we say that data is stored , we should ask the question where the data is stored.
Cache stores the data in Memory only which is basically same as persist(MEMORY_ONLY) i.e they both store the value in memory. But persist can store the value in Hard Disk or Heap as well.
What are the different storage options for persists
Different types of storage levels are:
NONE (default)
DISK_ONLY
DISK_ONLY_2
MEMORY_ONLY (default for cache operation for RDDs)
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
OFF_HEAP
As the name implies MEMORY_ONLY stores the data only in memory , similarly DISK_ONLY stores in disk and MEMORY_AND_DISK stores in Memory and when there is a data spill over the excess data is stored in Disk.
What is _2 present in Storage Levels ex: MEMORY_ONLY_2
This means that 2 replicas are taken.
Can the Storage Levels be changed once assigned
No, they cannot be changed.
Can WE remove data stored using cache and persist
Yes, we can do by calling unpersist.
df.unpersist
What happens when the source data changes, will the cached data also change
No, spark assumes that during the entire operation the source data will not change and it does not update the cached data .