When we call an Action on a Spark dataframe all the Transformations gets executed one by one. This happens because of Spark Lazy Evaluation which does not execute the transformations until an Action is called. In this article we will check commonly used Actions on Spark dataframe.

Spark Dataframe show()

The show() operator is used to display records of a dataframe in the output. By default it displays 20 records. To see the entire data we need to pass parameter

show(number of records , boolean value)
number of records : The number of records you need to display. Default is 20.
boolean value : false means show complete column value. Default is true.

There are 4 different versions of show()

  • show() – displays rows. Default output row is 20.
  • show(num of rows) – displays number of rows mentioned.
  • show(truncate) – This takes boolean value . True truncates the string column if the length is more than 20. False displays everything. Default is True.
  • show(num of rows , truncate) – This is a combination of above 2 operators.

Lets check an example.

val df1 = Seq(("Sam Mendissssssssssssssssssssssssssssssss","23","5.3"),("Henry Ford",null,"5.6"),("Ford","33",null)).toDF("Name","Age","Height")

df1.show()
+--------------------+----+------+
|                Name| Age|Height|
+--------------------+----+------+
|Sam Mendissssssss...|  23|   5.3|
|          Henry Ford|null|   5.6|
|                Ford|  33|  null|
+--------------------+----+------+

df1.show(2)
+--------------------+----+------+
|                Name| Age|Height|
+--------------------+----+------+
|Sam Mendissssssss...|  23|   5.3|
|          Henry Ford|null|   5.6|
+--------------------+----+------+
only showing top 2 rows

df1.show(false)
+-----------------------------------------+----+------+
|Name                                     |Age |Height|
+-----------------------------------------+----+------+
|Sam Mendissssssssssssssssssssssssssssssss|23  |5.3   |
|Henry Ford                               |null|5.6   |
|Ford                                     |33  |null  |
+-----------------------------------------+----+------+

df1.show(2,false)
+-----------------------------------------+----+------+
|Name                                     |Age |Height|
+-----------------------------------------+----+------+
|Sam Mendissssssssssssssssssssssssssssssss|23  |5.3   |
|Henry Ford                               |null|5.6   |
+-----------------------------------------+----+------+
only showing top 2 rows

head() and first() operator

head() operator returns the first row of the Spark Dataframe. If you need first n records then you can use head(n) . Lets look at the various versions

  • head() – returns first row
  • head(n) – return first n rows
  • first() – is an alias for head
  • take(n) – is an alias for head(n)
  • takeAsList(n) – returns first n records as list.
val df1 = Seq(("Sam Mendis","23","5.3"),("Henry Ford",null,"5.6"),("Ford","33",null)).toDF("Name","Age","Height")

df1.head()
org.apache.spark.sql.Row = [Sam Mendis,23,5.3]

df1.head(2)
Array[org.apache.spark.sql.Row] = Array([Sam Mendis,23,5.3], [Henry Ford,null,5.6])

df1.take(2)
Array[org.apache.spark.sql.Row] = Array([Sam Mendis,23,5.3], [Henry Ford,null,5.6])

df1.first()
org.apache.spark.sql.Row = [Sam Mendis,23,5.3]

df1.takeAsList(2)
java.util.List[org.apache.spark.sql.Row] = [[Sam Mendis,23,5.3], [Henry Ford,null,5.6]]

count() operator

The count() operator helps in counting the number of records of a dataframe.

val df1 = Seq(("Sam Mendis","23","5.3"),("Henry Ford",null,"5.6"),("Ford","33",null)).toDF("Name","Age","Height")

df1.count()
Long = 3

collect() & collectAsList() operator

collect() and collectAsList() operators return all the rows of the dataframe as an Array or List respectively. If the data is large this could throw an out of memory issue so this should be used carefully.

val df1 = Seq(("Sam Mendis","23","5.3"),("Henry Ford",null,"5.6"),("Ford","33",null)).toDF("Name","Age","Height")

df1.collect()
Array[org.apache.spark.sql.Row] = Array([Sam Mendis,23,5.3], [Henry Ford,null,5.6], [Ford,33,null])

df1.collectAsList()
java.util.List[org.apache.spark.sql.Row] = [[Sam Mendis,23,5.3], [Henry Ford,null,5.6], [Ford,33,null]]

reduce(func) operator

The reduce operator takes a function as an input and using the function reduces the entire dataset into a single value. The function should take 2 values as input and produce a single input with the same datatype as input.

val rdd1 = sc.parallelize(1 to 10)
rdd1.reduce(_ + _)

res1: Int = 55

🙂 kudos for learning something new 🙂

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading