We need to maintain the correct column order during insert into Spark Dataframe. If we don’t maintain the order then data can get inserted into wrong columns. In this blog we will see an example of the issue and also see the solution. The example codes are written in Scala .

Table of Contents

Issue

Let’s first check the scenario when this issue can occur. For this we will do the below steps

1.Create a table Student with fn, lnm ,age column
2.Create a dataframe containing dataframe
3.Insert into the Student table and check the output

//Student table creation
spark.sql("""Create table Student 
(fnm string, 
lnm string, 
age int)
""")

spark.sql("Select * from Student").show()
Output:
+---+---+---+
|fnm|lnm|age|
+---+---+---+
+---+---+---+

//Dataframe creation 
val df_nm = spark.sql("Select 32 as age, 'Henry' as fnm, 'Ford' as lnm")

//Insert into the dataframe
df_nm.write.insertInto("student")

spark.sql("Select * from Student").show()

Output:
+---+-----+----+
|fnm|  lnm| age|
+---+-----+----+
| 32|Henry|null|
+---+-----+----+

In the above output you can see that the data is inserted in wrong columns. The sequence in the table was [fnm,lnm,age] but the dataframe column sequence was [age,fnm,lnm]. This resulted in age data getting inserted into fnm column.

Note: Do comment below why age column has null data

Solution

You may think the solution is simple. All we need to do is during creation of dataframe we maintain the correct column order . But is it that simple. What if there were hundreds of columns? It would be tedious task to look at the table and create the select column accordingly. Even if we achieve that, what if tomorrow the DDL of table is changed and another 50 columns are added. And these 50 columns are randomly inserted in the existing sequence of column. You can understand where I am going with this.

Our goal should be to write a solution where the insert is not impacted by the sequence in which the dataframe has columns . For this we can follow the below steps.

1.Create a dataframe of the final table into which we need to insert.
2.Using the df.schema.fieldName get the list of list and sequence of columns.
3.Convert the list into a list of columns using the map function.
4.Before writing into the table Select the columns using df.select(cols:_*).write.insertInto(table)

Conclusion

So today we learnt how to maintain correct column order during insert into Spark Spark Dataframe using Scala.

To learn more about other Spark topics do visit my blog

//Create a dataframe of the final table into which we need to insert
val df_col= spark.sql("Select * from Student where 1=2")

//Using the df.schema.fieldName get the list of list and sequence of columns
val cols = df_col.schema.fieldNames
cols: Array[String] = Array(fnm, lnm, age)

//Convert the list into a list of columns using the map functio
import org.apache.spark.sql.functions._
val final_cols = cols.map(c => col(c))
final_cols: Array[org.apache.spark.sql.Column] = Array(fnm, lnm, age)


Before writing into the table Select the columns using columnlist:_*
df_nm.select(final_cols:_*).write.insertInto("student")
Output:
+-----+-----+----+
|  fnm|  lnm| age|
+-----+-----+----+
|Henry| Ford|  32|
+-----+-----+----+

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading