How to send Spark dataframe values as HTML table to Email?

Spark dataframe

Spark dataframe is a distributed collection of rows with the same schema. The schema defines the structure of the Dataframe such as column name and an associated type. Spark dataframe is equivalent to a table in a relational database or a dataframe in R/Python. Dataframe is available for general-purpose programming languages such as Java, Python, and Scala.

The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. The dataframe can be constructed from different sources such as Hive tables, external databases (example. Oracle,Mysql), structured data files or from existing RDD’s.

Write Pyspark program to send Spark dataframe as HTML table in Email

Step 1 : Create dataframe from Hive table

In this example, we are going to create a Spark dataframe by reading the Hive table. Let’s consider that we have a patient_report table in Hive which has the blood sugar details of each patients.

From this table, we are going to extract the patients who has the blood sugar level more than 200 mg/dL in the month of march,2022.

patient report in Hive
patient report in Hive

Lets write a Pyspark program to read those patient details from the Hive table and create a spark dataframe.

As we shown in the above program, First we are reading the patient details from Hive hospital_db.patient_report and creating a Spark dataframe as diagnosis_res_df.

Step 2 : Loop through dataframe using for loop

Next we are registering a dataframe to a temporary table with the name of diagnosis_temp_table. So that we can run a spark.sql() on this to extract the patients who has the blood sugar level more than 200 in the month of march,2022.

Also we used collect() function to retrieve data from the dataframe. Then we are iterating each element from the row and assigning each column value to a separate variable.

Step 3 : Prepare HTML table

In the previous step, we got the required values in a separate variable. Now we need to create a HTML table with those variables. In HTML table syntax, we have to define the table header, table row and table data.

Let’s define those code in Pyspark as below. The variable body holds the header section of the HTML table with Style properties. Then the variable strTable have the table header values. Next in the for loop, we are preparing the table data that will be used in the table row.

Step 4 : Send email using Pyspark

Next step is sending the HTML table in email. For the email, we need to define the below inputs

  • From Address – To mention the user who is sending the email
  • Recipients – To mention the email address of the recipients
  • CC – To mention the CC address of the recipients
  • Subject – To define the subject of the email
  • MIMEText – The body text of the email message. Here the body of the message is a HTML table.

Once we set all these values, we need to create a smtp object. In the smtp object ,we need to mention the mail server that will be used to send email.

Complete Pyspark program to send dataframe via email

Let’s write the complete Pyspark program by combining all the functions. The Pyspark program is saved with the name of send_blood_sugar_report.py.

Shell script to run the Pyspark program => test_script.sh

Let’s write a shell script to run this Pyspark program. In that script, we are setting the Spark environment variables. Then we are giving the spark-submit command to execute our Pyspark program send_blood_sugar_report.py.

Execute the shell script to run the Pyspark program

Finally we can run the shell script test_script.sh. It will execute Pyspark program to send blood sugar report to Doctor.

Output

Pyspark program execution log
Pyspark program execution log

Report in email

As we shown below, the report contains the patient details who has the blood sugar level more than 200 in the month of march,2022.

Report attached in mail
Report attached in mail

Recommended Articles