How to write a Spark dataframe to Hive table in Pyspark?

Apache spark to write a Hive Table

Apache Spark provides an option to read from Hive table as well as write into Hive table. In this tutorial, we are going to write a Spark dataframe into a Hive table. Since Spark has an in-memory computation, it can process and write a huge number of records in much faster way. Lets write a Pyspark program to perform the below steps.

  • Read the data from the csv file and load it into dataframe using Spark
  • Write a Spark dataframe into a Hive table.
Apache spark to write a Hive table
Apache spark to write a Hive table

Create a Spark dataframe from the source data (csv file)

We have a sample data in a csv file which contains seller details of E-commerce website. Using the Spark Dataframe Reader API, we can read the csv file and load the data into dataframe. The csv file seller_details.csv has the below data and it resides in the following path /x/home/user_alex. We are going to load this data into Spark dataframe.

Lets create a spark session variable with the name of “spark”. This variable is used to access the read method of Dataframe Reader API

Next we need to create the schema of the dataframe with the data type. Based on the source data which is in the csv file, we are creating the below schema.

Now we need to call the read method of Dataframe Reader API using the spark session variable.Lets write a code to load the csv data into Dataframe.

  • path of the source data – /x/home/user_alex/seller_details.csv
  • mode for dealing with corrupt records – FAILFAST (throws an exception when it meets corrupted records)
  • schema of the dataframe – sellerSchemaStruct (schema variable is created in previous step)
  • input option – Since our csv file has header record, we set it as true.
  • data source format – specified the input data source format as csv.

The dataframe sellerDetailsCsvDF will be created with the above statement.

Write a Spark dataframe to a Hive table

Similar to Dataframe Reader API, Spark has a Dataframe Writer API to handle the write operation. It has a write method to perform those operation. Using any of our dataframe variable, we can access the write method of the API.

We have two different ways to write the spark dataframe into Hive table.

Method 1 : write method of Dataframe Writer API

Lets specify the target table format and mode of the write operation.

  • Output data format – We mentioned the output data format as ORC. The default format is parquet.
  • mode – It specify the behaviour when the table is already exist. We mentioned as overwrite to overwrite the existing data.

Method 2 : create a temporary view

The createOrReplaceTempView method is used to create a temporary view from the dataframe. We created the view with the name of temp_table. This can be used as Hive table. The lifetime of this temporary table is tied to the SparkSession that was used to create this DataFrame.

Next we can execute the “create table as” statement to create the Hive table using the view name. Here the target table is seller_details and the database name is Sales_Db.

Complete code to create a dataframe and write it into a Hive Table

The Pyspark program is saved with the name as write_df_to_hive.py. The program execution is starts from main method (if name == “main“:). In this program, we are using the Dataframe write function to save the dataframe as Hive table. The another option to write a hive table is that creating a temporary view which is commented in the program.

Shell script to run the Pyspark program => test_script.sh

In this shell script, we are setting the Spark environment variable and running the spark-submit command to execute our Pyspark program write_df_to_hive.py.

Execute the shell script to run the Pyspark program

Finally we can run the shell script test_script.sh. It will execute Pyspark program to write the dataframe to Hive table.

Output

Spark dataframe to Hive Table
Spark dataframe to Hive Table

Lets check the Hive table seller_details in database Sales_Db. The Spark dataframe is saved as Hive table as below. The data is looks good. Also we checked the data type of the columns and format of the table using show create table statement. Appropriate data type is mapped for each columns as below. The table is created with the format of ORC as we given in the Pyspark program.

Recommended Articles