Here we will learn how to drop columns in dataframe using Spark scala. For this we will make use of drop() method. We will see examples of how we can drop single column and also multiple column at once.

First let’s create the dataset that we will use thought out this blog.

 val student = Seq(("Mark","Antony",34),("Loreal","Lewis",36),("Kate","Hamilton",55))
 val studentRdd = spark.sparkContext.parallelize(student)
 val studentRow = studentRdd.map(r => Row(r._1,r._2,r._3))
 val schema = new StructType()
   .add("Fnm",StringType,true)
   .add("Lnm",StringType,true)
   .add("Age",IntegerType,true)


 val df_student = spark.createDataFrame(studentRow ,schema )
 df_student.show(false)

 +------+--------+---+
 |Fnm   |Lnm     |Age|
 +------+--------+---+
 |Mark  |Antony  |34 |
 |Loreal|Lewis   |36 |
 |Kate  |Hamilton|55 |
 +------+--------+---+

drop single column in Spark Dataframe

To drop a single column in spark dataframe we need to use the below syntax.

syntax: df.drop(col(“col_nm”))

 import org.apache.spark.sql.functions._
 val df = df_student.drop(col("Age"))
 df.show()

 +------+--------+
 |   Fnm|     Lnm|
 +------+--------+
 |  Mark|  Antony|
 |Loreal|   Lewis|
 |  Kate|Hamilton|
 +------+--------+

drop multiple column in Spark Dataframe

There are 2 ways in which multiple columns can be dropped in a dataframe.

1.Create a list of columns to be dropped. Pass the List to drop method with : _* operator.
2.Pass the column names as comma separated string.
Let’s check examples of both the method:

Method1:

 val colList= List("fnm","lnm")
 val df = df_student.drop(colList:_*)
 df.show()

 +---+
 |Age|
 +---+
 | 34|
 | 36|
 | 55|
 +---+

Method2:

 val df = df_student.drop("fnm","lnm")
 df.show()

 +---+
 |Age|
 +---+
 | 34|
 | 36|
 | 55|
 +---+

Conclusion

So today we learnt how to drop single/multiple columns from a Spark Dataframe. As a good practice we should always drop the unwanted column which we no longer need from the dataframe, this helps Spark to work with less amount of data and makes the query run fast.

To learn about dropping duplicates records from dataframe please check this blog.

🙂 kudos for learning something new 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading