Let say that there is a scenario in which you need to find the list of External Tables from all the Tables in a Hive Database using Spark.
One way is to query hive metastore but this is always not possible as we may not have permission to access it.

We can try the below approach as well:

Step1: Create 1 Internal Table and 2 External Table.

scala> spark.sql("Create table TT_Test1(col1 int)")
scala> spark.sql("Create external table TT_Test2(col1 int) location 'hdfs:path'")
scala> spark.sql("Create external table TT_Test3(col1 int) location 'hdfs:path'")

Step2: Check the tables just created. As you can see it returns 3 columns.

scala> spark.sql("show tables like 'tt*'").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|DB1     | tt_test1|      false|
|DB1     | tt_test2|      false|
|DB1     | tt_test3|      false|
+--------+---------+-----------+

Step3: Create a function which takes the table name as input and returns a boolean value indicating that its a external or internal table.
The logic we will use is, show create table returns a string with the create table statement in it. If we can scan the string for EXTERNAL keyword then we can identify the type of table. The function would look something like this

val func1 =udf(( par1:String) =>{   
val df1 = spark.sql(s"""show create table ${par1}""")  
df1.filter($"createtab_stmt".contains("EXTERNAL")).count().toLong 
		})

Step4: Create a dataframe having list of table names

scala> val dftable = spark.sql("show tables like 'tt*'")
dftable: org.apache.spark.sql.DataFrame = [database: string, tableName: string ... 1 more field]
scala> dftable.show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
|DB1     | tt_test1|      false|
|DB1     | tt_test2|      false|
|DB1     | tt_test3|      false|
+--------+---------+-----------+

Step5: Pass the table_name column through the function created

scala> dftable.select($"tableName",func1($"tableName")).show
+---------+--------------+
|tableName|UDF(tableName)|
+---------+--------------+
| tt_test1|             0|
| tt_test2|             1|
| tt_test3|             1|
+---------+--------------+

As you can see above, the 1st table is internal hence the function returned 0 and for rest it returned 1.

Note: The same logic can be used to find multiple other things like if the Hive table is partitioned , we only have to change the keyword from “EXTERNAL” to “PARTITIONED BY”.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from UnderstandingBigData

Subscribe now to keep reading and get access to the full archive.

Continue reading