Choosing Between COALESCE with NULLIF and CASE Statements

A common scenario in data engineering is to replace NULL and empty strings with Default Values . In Spark SQL, if you are confused choosing between COALESCE with NULLIF and CASE statements, this article explains clearly on what to use when. Let’s dive into how each method works with code examples. Using COALESCE and NULLIF…

Hive Database Management Cheat Sheet

This is a comprehensive hive database management cheat sheet. Creating a New Database: Basic Syntax: Options: Usage Notes: Examples: Creating a New Database: Creating a Database with Default Locations: ALTER DATABASE Syntax: The ALTER DATABASE statement in Hive is used to change the properties of a database. Usage Examples: DROP DATABASE Syntax: The DROP DATABASE…

Deep Dive into Apache Spark’s reduceByKey vs groupByKey

Two of the frequently used transformations by data engineers in Apache Spark is reduceByKey and groupByKey . While they might seem similar at first glance, they serve different purposes and have distinct performance implications. But before we deep dive into Apache Spark’s reduceByKey vs groupByKey we need to quickly understand what role does shuffling play…

Different ways of creating delta table in Databricks

Today we will learn Different ways of creating delta table in Databricks. We will check how the tables can be created using the existing apache spark code and also spark sql. create managed delta table using SQL In a managed table, databricks maintains both the data and metadata of the table. Which means if you…

Spark SQL Count Function

Spark SQL has count function which is used to count the number of rows of a Dataframe or table. We can also count for specific rows. People who having exposure to SQL should already be familiar with this as the implementation is same. Let’s see the syntax and example. But before that lets create a…

Spark Escape Double Quotes in Input File

Here we will see how Spark Escape Double Quotes in Input File. Ideally having double quotes in a column in file is not an issue. But we face issue when the content inside the double quotes also have double quotes along with file separator. Let’s see an example for this. Below is the data we…

Spark UDF to Check Count of Nulls in each column

In this blog we will create a Spark UDF to Check Count of Nulls in each column. There could be a scenario where we would need to find the number of [nulls , ‘NA’ , “” , etc] in each column . This could help in analysis of the quality of data. Let us see…

Spark Function to check Duplicates in Dataframe

Here we will create a function to check if dataframe has duplicates Here we will not only create one method but will try and create multiple methods. Example 1 Here the function dupChk takes a dataframe as input. Now df.count function returns the count of dataframe and df.dropDuplicates removes the duplicates from the dataframe and…

How to Create Empty Dataframe in Spark Scala

Today we will learn how to create empty dataframe in Spark Scala. We will cover various methods on how to create empty dataframe with no schema and also create with schema. Empty Dataframe with no schema Here we will create an empty dataframe with does not have any schema/columns. For this we will use emptyDataframe()…

Repartition in SPARK

Repartition in Spark does a full shuffle of data and splits the data into chunks based on user input. Using this we can increase or decrease the number of partitions. There are three ways in which Spark can repartition the data, we will see examples of all. But first, let us understand why do we…

Loading…

Something went wrong. Please refresh the page and/or try again.


Follow My Blog

Get new content delivered directly to your inbox.