Different ways of creating delta table in Databricks

Today we will learn Different ways of creating delta table in Databricks. We will check how the tables can be created using the existing apache spark code and also spark sql. create managed delta table using SQL In a managed table, databricks maintains both the data and metadata of the table. Which means if you […]

Spark SQL Count Function

Spark SQL has count function which is used to count the number of rows of a Dataframe or table. We can also count for specific rows. People who having exposure to SQL should already be familiar with this as the implementation is same. Let’s see the syntax and example. But before that lets create a […]

Spark Escape Double Quotes in Input File

Here we will see how Spark Escape Double Quotes in Input File. Ideally having double quotes in a column in file is not an issue. But we face issue when the content inside the double quotes also have double quotes along with file separator. Let’s see an example for this. Below is the data we […]

Spark UDF to Check Count of Nulls in each column

In this blog we will create a Spark UDF to Check Count of Nulls in each column. There could be a scenario where we would need to find the number of [nulls , ‘NA’ , “” , etc] in each column . This could help in analysis of the quality of data. Let us see […]

Spark Function to check Duplicates in Dataframe

Here we will create a function to check if dataframe has duplicates Here we will not only create one method but will try and create multiple methods. Example 1 Here the function dupChk takes a dataframe as input. Now df.count function returns the count of dataframe and df.dropDuplicates removes the duplicates from the dataframe and […]

How to Create Empty Dataframe in Spark Scala

Today we will learn how to create empty dataframe in Spark Scala. We will cover various methods on how to create empty dataframe with no schema and also create with schema. Empty Dataframe with no schema Here we will create an empty dataframe with does not have any schema/columns. For this we will use emptyDataframe() […]

Repartition in SPARK

Repartition in Spark does a full shuffle of data and splits the data into chunks based on user input. Using this we can increase or decrease the number of partitions. There are three ways in which Spark can repartition the data, we will see examples of all. But first, let us understand why do we […]

Spark Sql Inner Join

In this blog, we will understand how to join 2 or more Dataframes in Spark. Inner Join in Spark works exactly like joins in SQL. If you are unfamiliar with what join is, it is used to combine rows from two or more dataframes, based on a related column between them. Inner Join returns records […]

ArrayType Column in Spark SQL

Apart from the basic Numeric, String, Datetime etc datatypes , Spark also has ArrayType Column in Spark SQL. This Type is not limited to only Array but it includes other collections like Seq and List . Note that all the code written below is in Scala . Spark Array Type Column Array is a collection […]

correct column order during insert into Spark Dataframe

We need to maintain the correct column order during insert into Spark Dataframe. If we don’t maintain the order then data can get inserted into wrong columns. In this blog we will see an example of the issue and also see the solution. The example codes are written in Scala . Issue Let’s first check […]

Loading…

Something went wrong. Please refresh the page and/or try again.


Follow My Blog

Get new content delivered directly to your inbox.