A common scenario in data engineering is to replace NULL and empty strings with Default Values . In Spark SQL, if you are confused choosing […]
Blogs
Hive Database Management Cheat Sheet
This is a comprehensive hive database management cheat sheet. Creating a New Database: Basic Syntax: Options: Usage Notes: Examples: Creating a New Database: Creating a […]
Deep Dive into Apache Spark’s reduceByKey vs groupByKey
Two of the frequently used transformations by data engineers in Apache Spark is reduceByKey and groupByKey . While they might seem similar at first glance, […]
Different ways of creating delta table in Databricks
Today we will learn Different ways of creating delta table in Databricks. We will check how the tables can be created using the existing apache […]
Spark SQL Count Function
Spark SQL has count function which is used to count the number of rows of a Dataframe or table. We can also count for specific […]
Spark Escape Double Quotes in Input File
Here we will see how Spark Escape Double Quotes in Input File. Ideally having double quotes in a column in file is not an issue. […]
Spark UDF to Check Count of Nulls in each column
In this blog we will create a Spark UDF to Check Count of Nulls in each column. There could be a scenario where we would […]
Spark Function to check Duplicates in Dataframe
Here we will create a function to check if dataframe has duplicates Here we will not only create one method but will try and create […]
How to Create Empty Dataframe in Spark Scala
Today we will learn how to create empty dataframe in Spark Scala. We will cover various methods on how to create empty dataframe with no […]
Repartition in SPARK
Repartition in Spark does a full shuffle of data and splits the data into chunks based on user input. Using this we can increase or […]