How to send Spark dataframe values as HTML table to Email?

Posted on 27th March 20222nd September 2022 by RevisitClass

Contents

Spark dataframe

Spark dataframe is a distributed collection of rows with the same schema. The schema defines the structure of the Dataframe such as column name and an associated type. Spark dataframe is equivalent to a table in a relational database or a dataframe in R/Python. Dataframe is available for general-purpose programming languages such as Java, Python, and Scala.

The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. The dataframe can be constructed from different sources such as Hive tables, external databases (example. Oracle,Mysql), structured data files or from existing RDD’s.

Write Pyspark program to send Spark dataframe as HTML table in Email

Step 1 : Create dataframe from Hive table

In this example, we are going to create a Spark dataframe by reading the Hive table. Let’s consider that we have a patient_report table in Hive which has the blood sugar details of each patients.

From this table, we are going to extract the patients who has the blood sugar level more than 200 mg/dL in the month of march,2022.

Lets write a Pyspark program to read those patient details from the Hive table and create a spark dataframe.

fetch_sql = "select * from hospital_db.patient_report"
diagnosis_res_df = spark.sql(fetch_sql)
diagnosis_res_df.show()

As we shown in the above program, First we are reading the patient details from Hive hospital_db.patient_report and creating a Spark dataframe as diagnosis_res_df.

Step 2 : Loop through dataframe using for loop

Next we are registering a dataframe to a temporary table with the name of diagnosis_temp_table. So that we can run a spark.sql() on this to extract the patients who has the blood sugar level more than 200 in the month of march,2022.

Also we used collect() function to retrieve data from the dataframe. Then we are iterating each element from the row and assigning each column value to a separate variable.

diagnosis_res_df.createOrReplaceTempTable("diagnosis_temp_table")
monthly_rpt = spark.sql("select * from diagnosis_temp_table where Diagnosis_dt between '2022-03-01' and '2022-03-31' and Blood_Sugar_level_in_mg_dL >=200").collect()
print(monthly_rpt)

for row in monthly_rpt:
    patient_id = str(row["patient_id"])
    patient_name = str(row["patient_name"])
    blood_sugar = str(row["blood_sugar_level_in_mg_dl"])
    diagnosis_dt = str(row["diagnosis_dt"])

Step 3 : Prepare HTML table

In the previous step, we got the required values in a separate variable. Now we need to create a HTML table with those variables. In HTML table syntax, we have to define the table header, table row and table data.

Let’s define those code in Pyspark as below. The variable body holds the header section of the HTML table with Style properties. Then the variable strTable have the table header values. Next in the for loop, we are preparing the table data that will be used in the table row.

body = "<html><head><style>table {font-family: arial, sans-serif;border-collapse: collapse;width: 100%;}td, th {border: 1px solid #dddddd;text-align: left;padding: 8px;}tr:nth-child(even) {background-color: #dddddd;}</style></head>"
strTable = "<table><tr><th>Patient_id</th><th>Patient_Name</th><th>Blood_Sugar_level_in_mg_dL</th><th>Diagnosis_dt</th></tr>"
strTable = body+strTable
for row in monthly_rpt:
                patient_id = str(row["patient_id"])
                patient_name = str(row["patient_name"])
                blood_sugar = str(row["blood_sugar_level_in_mg_dl"])
                diagnosis_dt = str(row["diagnosis_dt"])
                strRW = "<tr><td>"+patient_id+ "</td><td>"+patient_name+ "</td><td>"+ blood_sugar + "</td><td>" + diagnosis_dt + "</td>" + "</tr>"
                strTable = strTable+strRW

strTable = strTable+"</table></html>"

Step 4 : Send email using Pyspark

Next step is sending the HTML table in email. For the email, we need to define the below inputs

From Address – To mention the user who is sending the email
Recipients – To mention the email address of the recipients
CC – To mention the CC address of the recipients
Subject – To define the subject of the email
MIMEText – The body text of the email message. Here the body of the message is a HTML table.

Once we set all these values, we need to create a smtp object. In the smtp object ,we need to mention the mail server that will be used to send email.

fromAddr = "hospital_admin@mxc.com"
recipients = "ryan_dr@mxc.com"
cc = "chief_nurse@mxc.com"
subject = "Diabetes Report in March,2022"

 # Create the root message and fill in the from,to,cc and subject headers
msgRoot = MIMEMultipart('related')
msgRoot['Subject'] = subject
msgRoot['From'] = fromAddr
msgRoot['To'] = recipients
msgRoot['Cc'] = cc

#Set type of MIMEText as HTML 
msgText = MIMEText(strTable, 'html')
#Attach body of the message as HTML table
msgRoot.attach(msgText)

smtp = SMTP()
smtp.connect('mx.bstn.mxc.com', 25)
smtp.sendmail(fromAddr, recipients.split(','), msgRoot.as_string())
smtp.quit()

Complete Pyspark program to send dataframe via email

Let’s write the complete Pyspark program by combining all the functions. The Pyspark program is saved with the name of send_blood_sugar_report.py.

from pyspark.sql import SparkSession
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from smtplib import SMTP


def NotifyInEmail(diagnosis_res_df):
        diagnosis_res_df.createOrReplaceTempView("diagnosis_temp_table")
        monthly_rpt = spark.sql("select * from diagnosis_temp_table where Diagnosis_dt between '2022-03-01' and '2022-03-31' and Blood_Sugar_level_in_mg_dL >=200")
                        .collect()
        print(monthly_rpt)
        fromAddr = "hospital_admin@mxc.com"
        recipients = "ryan_dr@mxc.com"
        cc = "chief_nurse@mxc.com"
        subject = "Diabetes Report in March,2022"

        #Create body of the email message
        body = "<html><head><style>table {font-family: arial, sans-serif;border-collapse: collapse;width: 100%;}td, th {border: 1px solid #dddddd;text-align: left;padding: 8px;}tr:nth-child(even) {background-color: #dddddd;}</style></head>"
        footer = " <br/><br/> **This is an auto generated mail <br/><br/>"
        strTable = "<table><tr><th>Patient_id</th><th>Patient_Name</th><th>Blood_Sugar_level_in_mg_dL</th><th>Diagnosis_dt</th></tr>"
        strTable = body+strTable

        #Create dynamic HTML table with data
        for row in monthly_rpt:
                patient_id = str(row["patient_id"])
                patient_name = str(row["patient_name"])
                blood_sugar = str(row["blood_sugar_level_in_mg_dl"])
                diagnosis_dt = str(row["diagnosis_dt"])
                strRW = "<tr><td>"+patient_id+ "</td><td>"+patient_name+ "</td><td>"+ blood_sugar + "</td><td>" + diagnosis_dt + "</td>" + "</tr>"
                strTable = strTable+strRW

        strTable = strTable+"</table></html>"
        #print HTML table
        print(strTable)

        # Create the root message and fill in the from,to,cc and subject
        msgRoot = MIMEMultipart('related')
        msgRoot['Subject'] = subject
        msgRoot['From'] = fromAddr
        msgRoot['To'] = recipients
        msgRoot['Cc'] = cc

        #Set type of MIMEText as HTML 
        msgText = MIMEText(strTable, 'html')
        
        #Attach body of the message as HTML table
        msgRoot.attach(msgText)

        #Create SMTP object to send email
        smtp = SMTP()
        #Connect specific mail server with the input of server address and its port
        smtp.connect('mx.bstn.mxc.com', 25)
        #Send email
        smtp.sendmail(fromAddr, recipients.split(','), msgRoot.as_string())
        smtp.quit()

        print("Blood Sugar Report has been sent to Doctor successfully")

#custom function to access Hive Table
def FetchHiveTable():
        fetch_sql = "select * from hospital_db.patient_report"
        diagnosis_res_df = spark.sql(fetch_sql)
        diagnosis_res_df.show()
        #Call next custom function NotifyInEmail() to prepare HTML table and send email
        NotifyInEmail(diagnosis_res_df)

#Main program starts here
if __name__ == "__main__":
        appname = "Blood_Sugar_Diagnosis_Results"
        #Creating Spark Session
        spark = SparkSession.builder.appName(appname).enableHiveSupport().getOrCreate()

        print("Spark application name: " + appname)
        #Call custom function FetchHiveTable() to generate and send blood sugar report
        FetchHiveTable()
        spark.stop()
        exit(0)

Shell script to run the Pyspark program => test_script.sh

Let’s write a shell script to run this Pyspark program. In that script, we are setting the Spark environment variables. Then we are giving the spark-submit command to execute our Pyspark program send_blood_sugar_report.py.

#!/bin/bash
 
echo "Info: Setting global variables"
export SPARK_MAJOR_VERSION=2
export SPARK_HOME=/usr/hdp/2.6.5.0-292/spark2
 
export PATH=$SPARK_HOME/bin:$PATH
 
spark-submit /x/home/user_alex/test/send_blood_sugar_report.py

Execute the shell script to run the Pyspark program

Finally we can run the shell script test_script.sh. It will execute Pyspark program to send blood sugar report to Doctor.

sh test_script.sh

Output

Report in email

As we shown below, the report contains the patient details who has the blood sugar level more than 200 in the month of march,2022.

Recommended Articles