Spark Lazy Evaluation

Today we will learn about Spark Lazy Evaluation. We will learn about what it is, why is it required, how spark implements them, and what is its advantage. We know that Spark is written in Scala and Scala has an option to run lazily [You can check the lesson here] but for Spark, the execution […]

HDFS Data Blocks and Block Size

When a file is stored in HDFS, Hadoop breaks the file into BLOCKS before storing them. What this means is, when you store a file of big size Hadoop breaks them into smaller chunks based on predefined block size and then stores them in Data Nodes across the cluster. The default block size is 128mb […]

HIVE SHOW PARTITIONS

If you want to display all the Partitions of a HIVE table you can do that using SHOW PARTITIONS command. If you want to learn more about Hive Table Partitions you can check it here. So today we are going to understand the below topics. Table of Contents show partitions syntaxshow partitions using where orderby […]

Hive Split a row into multiple rows

You can split a row in Hive table into multiple rows using lateral view explode function. The one thing that needs to be present is a delimiter using which we can split the values. Lets dive straight into how to implement it. Table of Contents split row on single delimitersplit row on multiple delimiterConclusion split […]

Hive Table Partition

Using Hive Partition you can divide a table horizontally into multiple sections. This division happens based on a partition key which is just a column in your Hive table. Through out this lesson we will understand various aspects of Hive Partition. Table of Contents why use Partition in Hivehow to create partition in hive tablecreate […]

Spark Dataframe Actions

When we call an Action on a Spark dataframe all the Transformations gets executed one by one. This happens because of Spark Lazy Evaluation which does not execute the transformations until an Action is called. In this article we will check commonly used Actions on Spark dataframe. Table of Contents Spark Dataframe show()head() and first() […]

Spark Dataframe drop rows with NULL values

The data we normally deal with may not be clean. In such cases we may need to clean the data by applying some logic . One such case is presence of null values in rows. We can handle it by dropping the spark dataframe rows using the drop() function . Table of Contents drop rows […]

Spark Dataframe withColumn

Using Spark withColumn() function we can add , rename , derive, split etc a Dataframe Column. There are many other things which can be achieved using withColumn() which we will check one by one with suitable examples. But first lets create a dataframe which we will use to modify throughout this tutorial. Through out this […]

SPARK DATAFRAME Union AND UnionAll

Using Spark Union and UnionAll you can merge data of 2 Dataframes and create a new Dataframe. Remember you can merge 2 Spark Dataframes only when they have the same Schema. Union All is deprecated since SPARK 2.0 and it is not advised to use any longer. Lets check with few examples . Note:- Union […]

SPARK distinct and dropDuplicates

Both Spark distinct and dropDuplicates function helps in removing duplicate records. One additional advantage with dropDuplicates() is that you can specify the columns to be used in deduplication logic. We will see the use of both with couple of examples. Table of Contents SPARK Distinct FunctionSpark dropDuplicates() FunctiondropDuplicates() error: type mismatchConclusion SPARK Distinct Function The […]

Loading…

Something went wrong. Please refresh the page and/or try again.


Follow My Blog

Get new content delivered directly to your inbox.