Using Spark Union and UnionAll you can merge data of 2 Dataframes and create a new Dataframe. Remember you can merge 2 Spark Dataframes only when they have the same Schema. Union All is deprecated since SPARK 2.0 and it is not advised to use any longer. Lets check with few examples .

Note:- Union only merges the data between 2 Dataframes but does not remove duplicates after the merge.

SYNTAX of UNION in Spark Dataframe

The syntax is pretty straight forward
df1.union(df2)
where df1 and df2 are 2 dataframes with same schema.

Lets check this with an example.

val df1 = Seq(("Smith",23),("Rashmi",27)).toDF("Name","Age")
val df2 = Seq(("Smith",23),("Payal",27)).toDF("Name","Age")
df1.union(df2).show

+------+---+
|  Name|Age|
+------+---+
| Smith| 23|
|Rashmi| 27|
| Smith| 23|
| Payal| 27|
+------+---+

Here we created 2 dataframes and did a union operation on them. Notice that the duplicate records are not removed.

SYNTAX of UNION ALL in Spark Dataframe

Syntax of union all is similar to union.
df1.unionAll(df2)
This works similar to union.

val df1 = Seq(("Smith",23),("Rashmi",27)).toDF("Name","Age")
val df2 = Seq(("Smith",23),("Payal",27)).toDF("Name","Age")

df1.unionAll(df2).show

+------+---+
|  Name|Age|
+------+---+
| Smith| 23|
|Rashmi| 27|
| Smith| 23|
| Payal| 27|
+------+---+
warning: method unionAll in class Dataset is deprecated: use union()

Notice that as soon as you use unionAll you immediately get a warning that unionAll is deprecated and instead it suggests to use union.

Merge 2 Dataframes and Remove Duplicates

In case you need to remove the duplicates after merging them you need to use distinct or dropDuplicates after merging them.

Lets check an example below.

val df1 = Seq(("Smith",23),("Rashmi",27)).toDF("Name","Age")
val df2 = Seq(("Smith",23),("Payal",27)).toDF("Name","Age")

df1.union(df2).distinct.show

+------+---+
|  Name|Age|
+------+---+
| Payal| 27|
|Rashmi| 27|
| Smith| 23|
+------+---+

Merge Multiple Dataframes

You can merge N number of dataframes one after another by using union keyword multiple times. We will see an example for the same. But what if there are 100’s of dataframes you need to merge . Will you be writing union as many times or is there a better way .

First lets create 3 dataframes that we need to merge.

val df1 = Seq(("Smith",23),("Rashmi",27)).toDF("Name","Age")
val df2 = Seq(("Smith",23),("Payal",27)).toDF("Name","Age")
val df3 = Seq(("Kevin",29)).toDF("Name","Age")

Now the First method is to us union keyword multiple times to merge the 3 dataframes.

f1.union(df2).union(df3).show
+------+---+
|  Name|Age|
+------+---+
| Smith| 23|
|Rashmi| 27|
| Smith| 23|
| Payal| 27|
| Kevin| 29|
+------+---+

Now in our Second method we will use reduce function with union to do the same. But first we need to create a sequence of all the dataframes that we need to merge.

//Create a Sequence of all dataframes which needs to be merged
val dfs = Seq(df1,df2,df3)
dfs.reduce(_ union _).show
+------+---+
|  Name|Age|
+------+---+
| Smith| 23|
|Rashmi| 27|
| Smith| 23|
| Payal| 27|
| Kevin| 29|
+------+---+

Merge Dataframes with different Schema

We know that we can merge 2 dataframes only when they have the same schema. So the question is there a workaround to merge when the schema do not match? The answer is yes.

First Workaround is to append nulls to missing columns. You can see in the below example, while doing union I have introduced a new null column so that the schema of both table matches.

val df1 = Seq(("Smith",23),("Rashmi",27)).toDF("Name","Age")
val df2 = Seq(("Smith"),("Payal")).toDF("Name")

df1.union(df2.select('Name , lit(null).as("age"))).show
+------+----+
|  Name| Age|
+------+----+
| Smith|  23|
|Rashmi|  27|
| Smith|null|
| Payal|null|
+------+----+

Second Workaround is to only select required columns from both table when ever possible. Suppose we only needed NAME column from both tables. Then we can select only that column and then merge them.

val df1 = Seq(("Smith",23),("Rashmi",27)).toDF("Name","Age")
val df2 = Seq(("Smith"),("Payal")).toDF("Name")

df1.select('Name).union(df2.select('Name)).show
+------+
|  Name|
+------+
| Smith|
|Rashmi|
| Smith|
| Payal|
+------+

Conclusion

So today we learnt

  • The syntax of Spark dataframe union and unionAll and how to use them.
  • How to merge dataframes and remove duplicates.
  • Various ways to merge multiple dataframes
  • Merge dataframes with different schemas.

🙂 kudos for learning something new  🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading