Before we understand the difference between Coalesce and Repartition we first need to understand what Spark Partition is.
Simply put Partitioning data means to divide the data into smaller chunks so that they can be processed in a parallel manner.

Using Coalesce and Repartition we can change the number of partition of a Dataframe.
Coalesce can only decrease the number of partition.
Repartition can increase and also decrease the number of partition.

Also,
Coalesce doesn’t do a full shuffle which means it does not equally divide the data into all partitions, it moves the data to nearest partition.
Repartition on the other hand divides the data equally into all partitions hence triggering a full shuffle. This is the reason Repartition is a costly operation.

Interview Q1> How can we find out the number of partition of a Dataframe

We can find using rdd.partitions.size

scala> val df = List(1,2,3,4,5,6,7,8,9,10,11,12).toDF("num")
df: org.apache.spark.sql.DataFrame = [num: int]

scala> df.rdd.partitions.size
res3: Int = 2

Now using Coalesce let reduce the number of partitions in Dataframe

scala> val df1 = df.coalesce(1)
df1: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> df1.rdd.partitions.size
res6: Int = 1

Interview Q2> Can we increase the number of partitions using Coalesce

No , as you can see below even if we try to increase the number of partition the count remains same

scala> val df2 = df.coalesce(4)
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [num: int]

scala> df2.rdd.partitions.size
res8: Int = 2

Interview Q3> How to increase or decrease the number of partitions using Repartition

scala> df.rdd.partitions.size
res9: Int = 2

scala> df.repartition(1).rdd.partitions.size
res10: Int = 1

scala> df.repartition(4).rdd.partitions.size
res11: Int = 4

Interview Q4> How does shuffle work with Coalesce and Repartition

Coalesce doesnt do a full shuffle which means it does not equally divide the data into all partitions, it moves the data to nearest partition.
Which means
if we have 4 Partitions
Par1 a,b,c
Par2 d,e,f
Par3 g,h,i
Par4 j,k,l

And then we use coalesce to reduce the number of paritition to 2. Now the data could look like, depending on the location of parition
Par1 a,b,c,j,k,l
Par2 d,e,f,g,h,i
As you can see in the above case, data in Par1 and Par2 were not moved, hence reducing the overall data movement.

Repartition on the other hand divides the data equally into all partitions, so it pulls out all the data and divides the equally triggering a full shuffle. This is the reason Repartition is a costly operation.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading