Home - UnderstandingBigData

Latest from the Blog

Choosing Between COALESCE with NULLIF and CASE Statements

A common scenario in data engineering is to replace NULL and empty strings with Default Values . In Spark SQL, if you are confused choosing between COALESCE with NULLIF and CASE statements, this article explains clearly on what to use when. Let’s dive into how each method works with code examples. Using COALESCE and NULLIF…

November 7, 2023November 7, 2023

Hive Database Management Cheat Sheet

This is a comprehensive hive database management cheat sheet. Creating a New Database: Basic Syntax: Options: Usage Notes: Examples: Creating a New Database: Creating a Database with Default Locations: ALTER DATABASE Syntax: The ALTER DATABASE statement in Hive is used to change the properties of a database. Usage Examples: DROP DATABASE Syntax: The DROP DATABASE…

November 5, 2023November 5, 2023

Deep Dive into Apache Spark’s reduceByKey vs groupByKey

Two of the frequently used transformations by data engineers in Apache Spark is reduceByKey and groupByKey . While they might seem similar at first glance, they serve different purposes and have distinct performance implications. But before we deep dive into Apache Spark’s reduceByKey vs groupByKey we need to quickly understand what role does shuffling play…

October 6, 2023October 6, 2023

Get new content delivered directly to your inbox.