We need to maintain the correct column order during insert into Spark Dataframe. If we don’t maintain the order then data can get inserted into wrong columns. In this blog we will see an example of the issue and also see the solution. The example codes are written in Scala .
Table of Contents
Issue
Let’s first check the scenario when this issue can occur. For this we will do the below steps
1.Create a table Student with fn, lnm ,age column
2.Create a dataframe containing dataframe
3.Insert into the Student table and check the output
//Student table creation spark.sql("""Create table Student (fnm string, lnm string, age int) """) spark.sql("Select * from Student").show() Output: +---+---+---+ |fnm|lnm|age| +---+---+---+ +---+---+---+ //Dataframe creation val df_nm = spark.sql("Select 32 as age, 'Henry' as fnm, 'Ford' as lnm") //Insert into the dataframe df_nm.write.insertInto("student") spark.sql("Select * from Student").show() Output: +---+-----+----+ |fnm| lnm| age| +---+-----+----+ | 32|Henry|null| +---+-----+----+
In the above output you can see that the data is inserted in wrong columns. The sequence in the table was [fnm,lnm,age] but the dataframe column sequence was [age,fnm,lnm]. This resulted in age data getting inserted into fnm column.
Note: Do comment below why age column has null data
Solution
You may think the solution is simple. All we need to do is during creation of dataframe we maintain the correct column order . But is it that simple. What if there were hundreds of columns? It would be tedious task to look at the table and create the select column accordingly. Even if we achieve that, what if tomorrow the DDL of table is changed and another 50 columns are added. And these 50 columns are randomly inserted in the existing sequence of column. You can understand where I am going with this.
Our goal should be to write a solution where the insert is not impacted by the sequence in which the dataframe has columns . For this we can follow the below steps.
1.Create a dataframe of the final table into which we need to insert.
2.Using the df.schema.fieldName get the list of list and sequence of columns.
3.Convert the list into a list of columns using the map function.
4.Before writing into the table Select the columns using df.select(cols:_*).write.insertInto(table)
Conclusion
So today we learnt how to maintain correct column order during insert into Spark Spark Dataframe using Scala.
To learn more about other Spark topics do visit my blog
//Create a dataframe of the final table into which we need to insert val df_col= spark.sql("Select * from Student where 1=2") //Using the df.schema.fieldName get the list of list and sequence of columns val cols = df_col.schema.fieldNames cols: Array[String] = Array(fnm, lnm, age) //Convert the list into a list of columns using the map functio import org.apache.spark.sql.functions._ val final_cols = cols.map(c => col(c)) final_cols: Array[org.apache.spark.sql.Column] = Array(fnm, lnm, age) Before writing into the table Select the columns using columnlist:_* df_nm.select(final_cols:_*).write.insertInto("student") Output: +-----+-----+----+ | fnm| lnm| age| +-----+-----+----+ |Henry| Ford| 32| +-----+-----+----+