Spark SQL Count Function - UnderstandingBigData

Spark SQL has count function which is used to count the number of rows of a Dataframe or table. We can also count for specific rows. People who having exposure to SQL should already be familiar with this as the implementation is same.

Let’s see the syntax and example. But before that lets create a Dataframe which we will use through this blog.

  val list = List(("1","2"),("3",null),("4","5"))
  val df1: DataFrame = spark.sparkContext.parallelize(list).toDF("col1","col2")
  df1.show()

+----+----+
|col1|col2|
+----+----+
|   1|   2|
|   3|null|
|   4|   5|
+----+----+

Syntax of Count Function

The syntax if pretty straight forward.

To check count of Dataframe : df.count()

To check count of specific column in Dataframe : df.select(count(col_nm))

Let us count the rows of the Dataframe we created above.

println(df1.count())

3

Now let us see how we can find the row count for particular column. Now you must be wondering wont the Dataframe count be same as column count? Well lets find out ourselves.

  df1.select(count(col("col1")).as("COl1_CNT"),
  count(col("col2")).as("COl2_CNT")).show()

+--------+--------+
|COl1_CNT|COl2_CNT|
+--------+--------+
|       3|       2|
+--------+--------+

Interestingly as you can see first column returned 3 but second column returned 2. The reason being count() function does not count nulls.

So let us come up with a workaround on how we can get the true count of rows of a column even if it has nulls. For this we can check if it is null and assign “TRUE” or “FALSE” . Then we can cast these values to 1 or 0 and then apply count. The implementation is as below.

df1.select(count(isnull(col("col2")).cast("int")).as("COl2_CNT")).show

+--------+
|COl2_CNT|
+--------+
|       3|
+--------+

🙂 kudos for learning something new 🙂

If you want to know how you can drop rows with null values you can check my other blog.

Syntax of Count Function

Leave a ReplyCancel reply

Discover more from UnderstandingBigData