Today we will learn how to Select columns from a Spark Dataframe. While selecting we can show complete list of columns or select only few of them or even have some operations done on them. Dataframe Columns can be displayed as a form of String or as an instance of Column class. We will discuss later in the blog what this means and why we have 2 ways to refer columns.

SPARK Dataframe Column

First lets understand the syntax as to how to refer a Column. There are several ways in which it can be done as shown below

  • “” (using double quotes) -> “col_nm” This refers to column as string type.
  • column/col – column(“col_nm”)/col(“col_nm”) This refers to column as an instance of Column class
  • $/ ‘(tick) – $(“col_nm”) / ‘col_nm This is syntatic sugar to show column class in scala.

As you notice there are basically 2 ways in which you can refer a column, either through String or a Column Class. If you want to have some expressions on your columns then you have to refer the column as an instance of Column Class. Lets check this out with examples.

//Columns as string
val dfTN = Seq(("Smith",45),("Andy",77)).toDF("Name","Marks");
dfTN.select("Name","Marks").show;
+-----+-----+
| Name|Marks|
+-----+-----+
|Smith|   45|
| Andy|   77|
+-----+-----+

//Lets add 5 marks to each student String Column Type
dfTN.select("Name","Marks" + 5).show
//This throws an error
org.apache.spark.sql.AnalysisException: cannot resolve '`Marks5`' given input columns: [Name, Marks];

//To resolve this we need to refer column as 
//an instance of Column Class
import org.apache.spark.sql.functions._
dfTN.select('Name,'Marks + 5).show
dfTN.select(col("Name"),col("Marks") + 5).show
dfTN.select(column("Name"),column("Marks") + 5).show
dfTN.select($"Name",$"Marks" + 5).show

+-----+-----------+
| Name|(Marks + 5)|
+-----+-----------+
|Smith|         50|
| Andy|         82|
+-----+-----------+

Few things to notice here

  • When Column type is string we cannot perform any expressions.
  • In Line 18 we have to import org.apache.spark.sql.functions._ before we use column functions.
  • When we refer column as an instance of Column Class we are able to use expressions.
  • Next, lets learn about Select

SPARK Dataframe select

As seen before we use SELECT to fetch all are selected columns from a dataframe. The syntax is pretty straight forward df.select() . Also, we have understood till now that the columns are of String or Column Type. One thing to notice here is while using Select we should not mix both types. Lets check it with an example.

val dfTN = Seq(("Smith",45),("Andy",77)).toDF("Name","Marks");
dfTN.select("Name",'Marks).show;
error: overloaded method value select with alternatives:

As you see above when we tried to mix the column types “String” and ‘Column ,spark immediately threw an error. So the correct way is

//either
dfTN.select("Name","Marks").show;
//or
dfTN.select('Name,'Marks + 5).show

+-----+-----------+
| Name|(Marks + 5)|
+-----+-----------+
|Smith|         50|
| Andy|         82|
+-----+-----------+

If you notice, in the output Spark shows the column name as “Marks + 5” which is not the ideal way of displaying a column. To provide a new name we need to use AS or ALIAS.

dfTN.select('Name,('Marks + 5).as("newMarks")).show
dfTN.select('Name,('Marks + 5).alias("newMarks")).show

+-----+--------+
| Name|newMarks|
+-----+--------+
|Smith|      50|
| Andy|      82|
+-----+--------+

SPARK Dataframe selectExpr

If you are comfortable with SQL and want to write the columns and expressions similar to SQL then selectExpr is for you. Lets check it wit an example

val dfTN = Seq(("Smith",45),("Andy",77)).toDF("Name","Marks");
dfTN.selectExpr("Name" , "Marks + 5 as newMarks" ).show

+-----+--------+
| Name|newMarks|
+-----+--------+
|Smith|      50|
| Andy|      82|
+-----+--------+

Conclusion

So today we learnt

  • How to refer columns in Spark Dataframe.
  • Various ways in which columns can be referred in Spark Dataframe.
  • What are columns of String type and Instance of Class Type.
  • How to use select and selectExpr.
  • How to provide a new name to column using AS /ALIAS.

🙂 kudos for learning something new 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading