How to access a Hive table using Pyspark?

Pyspark

Apache Spark is an in-memory data processing framework written in Scala language. It process the data 100 times faster than Hadoop map reduce jobs. It is providing API’s for the programming languages such as Scala, Java and Python.

Pyspark is a Python API to support python with Apache Spark. It allows us to write a spark application in Python script. Also it provides the PySpark shell for interactively analyzing our data in a distributed environment. Using the spark-submit command, We can submit this Spark or Pyspark application to the cluster.

Hive Table

In this tutorial, we are going to read the Hive table using Pyspark program. In Hive, we have a table called electric_cars in car_master database. It contains two columns such as car_model and price_in_usd.

Write Pyspark program to read the Hive Table

Step 1 : Set the Spark environment variables

Before running the program, we need to set the location where the spark files are installed. Also it needs to be add to the PATH variable. In case if we have multiple spark version installed in the system, we need to set the specific spark version also.

Step 2 : spark-submit command

We write a spark program in the python script with the file extension of .py. To submit that spark application to cluster, we use the spark-submit command as below

Step 3: Write a Pyspark program to read hive table

In the pyspark program, we need to create a spark session. For that we are importing the pyspark library as dependency

Next we creates the spark session in the main block which is used to read the Hive table. enableHiveSupport() – Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

To read a Hive table, We are writing a custom function as FetchHiveTable. This function runs select query on the electric_cars table using spark.sql method. Then we are storing the result in the data frame.

Next we are using collect() function to retrieve the elements from data frame. It returns the elements in an array. Please note that collect function is recommended for the small dataset as it brings all data to the driver node. Otherwise it will cause out of memory error.

Next we will call this function from the main block and stop the spark application.

Pyspark program to read Hive table => read_hive_table.py

Lets code pyspark program in a single file and name it as read_hive_table.py.

Shell script to call the Pyspark program => test_script.sh

To set the spark environment variables and execute our pyspark program, we are creating shell script file named as test_script.sh. In side this script, we are executing the pyspark program using spark-submit command.

Execute shell script to run the Pyspark program

Finally we are running the shell script file test_script.sh as below. It will execute our pyspark program read_hive_table.py.

Output

Pyspark program to read hive table
Pyspark program to read hive table