If we are using an RDD multiple number of times in our program, the RDD will be recomputed everytime. This is a performance issue. To avoid this we store the output of RDD in memory/disk so that the same data can be used multiple time. This is achived by cache and persist.

For example: Lets create a Dataframe which contains number 1 to 10

val df = Seq(1,2,3,4,5,6,7,8,9,10).toDF("num")
df: org.apache.spark.sql.DataFrame = [num: int]

Now Dataframe df does not contains the data , it simply says that it will create the data when an action is called.
Next lets take a count of Dataframe

df.count
res0: Long = 10

Now count is an action, so first the Dataframe df will read the data and then count it. Now this does not mean that the RDD has stored the data and it can read if its called again.

To make that happen we can use Cache df.cache(), but still this doesnt mean anything. It simply states that when an action is called, it read the data and cache it. When the next action is called ex:- df.count then the data is fetched and cached for further use.

Persists also does the same thing, then that begs the first question.

How Persist is different from Cache

When we say that data is stored , we should ask the question where the data is stored.
Cache stores the data in Memory only which is basically same as persist(MEMORY_ONLY) i.e they both store the value in memory. But persist can store the value in Hard Disk or Heap as well.

What are the different storage options for persists

Different types of storage levels are:
NONE (default)
DISK_ONLY
DISK_ONLY_2
MEMORY_ONLY (default for cache operation for RDDs)
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
OFF_HEAP

As the name implies MEMORY_ONLY stores the data only in memory , similarly DISK_ONLY stores in disk and MEMORY_AND_DISK stores in Memory and when there is a data spill over the excess data is stored in Disk.

What is _2 present in Storage Levels ex: MEMORY_ONLY_2

This means that 2 replicas are taken.

Can the Storage Levels be changed once assigned

No, they cannot be changed.

Can WE remove data stored using cache and persist

Yes, we can do by calling unpersist.

df.unpersist

What happens when the source data changes, will the cached data also change

No, spark assumes that during the entire operation the source data will not change and it does not update the cached data .

Learn difference between COALESCE and REPARTITION

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading