Spark Dataframe withColumn - UnderstandingBigData

Using Spark withColumn() function we can add , rename , derive, split etc a Dataframe Column. There are many other things which can be achieved using withColumn() which we will check one by one with suitable examples. But first lets create a dataframe which we will use to modify throughout this tutorial.

val df1 = Seq(("Smith",23),("Monica",22)).toDF("Name","Age")
df1.show()

+------+---+
|  Name|Age|
+------+---+
| Smith| 23|
|Monica| 22|
+------+---+

Through out this page you will notice that sometimes i have referred column as “column” or ‘column col(“column”) . To understand please read this.

Table of Contents

add new column to dataframe Spark

We can add a new column to the existing dataframe using the withColumn() function. The function will take 2 parameters , i)The column name ii)The value to be filled across all the existing rows.

df.withColumn(“name” , “value”)

Let’s add a new column Country to the Spark Dataframe and fill it with default Country value as ‘USA‘.

df1.withColumn("Country" , lit("USA")).show

+------+---+-------+
|  Name|Age|Country|
+------+---+-------+
| Smith| 23|    USA|
|Monica| 22|    USA|
+------+---+-------+

If you notice i have enclosed the value “USA” inside lit() function. The reason being withColumn() function expects the 2nd parameter to be of type Column Object . So lit() function converts the String “USA” into a Column Object. If we do not do this Spark will throw an error

error: type mismatch; found : String(“USA”)
required: org.apache.spark.sql.Column

replace column value in dataframe Spark

We can replace all or some of the values of an existing column of Spark dataframe. We can loosely say that it works like an update in SQL. The syntax is similar to adding new column

df.withColumn(“existing col name” , “value”)

replace value of all rows

Bydefualt if we only provide the value then the value is replaced in all rows of the dataframe column.

val df1 = Seq(("Smith",23),("Monica",22)).toDF("Name","Age")
df1.withColumn("Name" , lit("Unknown")).show

+-------+---+
|   Name|Age|
+-------+---+
|Unknown| 23|
|Unknown| 22|
+-------+---+

replace value of some rows based on logic

Instead of replaceing the values of all rows we can selectively replace the values based on some logic. Lets say in our dataframe if the Age is less than equal to 22 then the value should be “LESS” and if more then 22 then it should be “MORE”. Lets see how this can be done.

val df1 = Seq(("Smith",23),("Monica",19)).toDF("Name","Age")
df1.withColumn("Age" , when('Age <= 22 , "LESS").otherwise("MORE")).show

+------+----+
|  Name| Age|
+------+----+
| Smith|MORE|
|Monica|LESS|
+------+----+

If you want to keep the existing column and add this as a new column you can do that as below.

df1.withColumn("NewAge" , when('Age <= 22 , "LESS").otherwise("MORE")).show
+------+---+------+
|  Name|Age|NewAge|
+------+---+------+
| Smith| 23|  MORE|
|Monica| 19|  LESS|
+------+---+------+

change column datatype using Spark withColumn

We can change the datatype of a column using Spark Dataframe withColumn() function. In our existing dataframe the Age column is an Int , lets change the datatype to String.

val df1 = Seq(("Smith",23),("Monica",19)).toDF("Name","Age")
df1.withColumn("Age" , 'Age.cast("String")).schema
val df2=df1.withColumn("Age" , 'Age.cast("String"))
df2.show

+------+---+
|  Name|Age|
+------+---+
| Smith| 23|
|Monica| 19|
+------+---+

import org.apache.spark.sql.functions._
df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int]
df2: org.apache.spark.sql.DataFrame = [Name: string, Age: string]

You can see in the above code the datatype of Column Age was changed from Int to String. Also notice that there was no impact on the existing data.

So here’s a question for you, can we change the datatype of Name Column from String to Int ?

The answer is both yes and no. Spark would allow us to change the datatype from string to int but it would show null when we try to read the data. This is due to Spark Schema on read.

val df1 = Seq(("Smith",23),("Monica",19)).toDF("Name","Age")
df1.withColumn("Age" , 'Age.cast("String")).schema
val df2=df1.withColumn("Name" , 'Name.cast("Int"))
df2.show

+----+---+
|Name|Age|
+----+---+
|null| 23|
|null| 19|
+----+---+

import org.apache.spark.sql.functions._
df1: org.apache.spark.sql.DataFrame = [Name: string, Age: int]
df2: org.apache.spark.sql.DataFrame = [Name: int, Age: int]

create new column from existing spark dataframe column

Using Spark Datafrme withcolumn() function you can create a new column using an existing column in the dataframe. While creating the new column you can apply some desired operation.

To understand this with an example lets create a new column called “NewAge” which contains the same value as Age column but with 5 added to it.

val df1 = Seq(("Smith",23),("Monica",19)).toDF("Name","Age")
df1.withColumn("NewAge" , 'Age + 5 ).show

+------+---+------+
|  Name|Age|NewAge|
+------+---+------+
| Smith| 23|    28|
|Monica| 19|    24|
+------+---+------+

You can also create new column using multiple existing columns in the Dataframe.

val df1 = Seq(("Smith",23),("Monica",19)).toDF("Name","Age")
df1.withColumn("NameAge" , concat(lit("The name is ") , 'Name , lit(" and the new age is ") , 'Age + 5) ).show(false)

+------+---+----------------------------------------+
|Name  |Age|NameAge                                 |
+------+---+----------------------------------------+
|Smith |23 |The name is Smith and the new age is 28 |
|Monica|19 |The name is Monica and the new age is 24|
+------+---+----------------------------------------+

rename dataframe column

You can rename one or multiple columns in a Spark Dataframe using withColumnRenamed() function. Lets check this by changing the NAME column to FullName and also AGE to NewAge.

val df1 = Seq(("Smith",23),("Monica",19)).toDF("Name","Age")
df1.withColumnRenamed("Name" , "FullName").show
df1.withColumnRenamed("Name" , "FullName" ).withColumnRenamed("Age" ,"NewAge").printSchema

+--------+---+
|FullName|Age|
+--------+---+
|   Smith| 23|
|  Monica| 19|
+--------+---+

root
 |-- FullName: string (nullable = true)
 |-- NewAge: integer (nullable = false)

split one dataframe column into multiple columns

Using a combination of withColumn() and split() function we can split the data in one column into multiple. Let create a dataframe which has full name and lets split it into 2 column FirtName and LastName.

val df1 = Seq(("Sam Mendis"),("Henry Ford")).toDF("Name")

df1.withColumn("tmp_Name" , split('Name , " ")) .select(
     'tmp_name.getItem(0).as("FirstName"),
     'tmp_name.getItem(1).as("LastName")).show 

+---------+--------+
|FirstName|LastName|
+---------+--------+
|      Sam|  Mendis|
|    Henry|    Ford|
+---------+--------+

Conclusion

So today we learnt how to

add a new column to existing dataframe
replace column value in Spark Dataframe
change datatype of a column
create new column from existing dataframe column
rename an existing dataframe column
split one dataframe column into multiple column

🙂 kudos for learning something new 🙂