How to convert Pandas dataframe to Spark dataframe?

Convert Pandas to Spark dataframe

In some cases, we use Pandas library in Pyspark to perform the data analysis or transformation. But Pandas has few drawbacks as below

  • It cannot make use of multiple machine. In other words, it doesn’t support distributed processing.
  • The whole dataset needs to fits into the RAM of the driver/single machine.

On the other hand, Spark DataFrames are distributed across nodes of the Spark cluster. When the amount of data is large, it is better to convert the Pandas dataframe to Spark dataframe and do the complex transformation. Let’s look at the ways to make this conversion.

Convert Pandas dataframe to Spark dataframe
Convert Pandas dataframe to Spark dataframe

Syntax

  • spark – It is a spark session object
  • data – List of values on which dataframe is created
  • schema – The structure/column names of the data set

Example 1: Convert Pandas to Spark dataframe using spark.createDataFrame() method

Let’s create Pandas dataframe in Pyspark.

Output

Create Pandas dataframe in Pyspark
Create Pandas dataframe in Pyspark

Now we have a Pandas dataframe in the variable pandasDF. We can pass this variable into the CreateDataFrame method. So that it will be converted to Spark dataframe.

Output

Spark dataframe in Pyspark
Spark dataframe in Pyspark

Example 2: Change column name and data type while converting the dataframe

In this example, we use the same Pandas dataframe values. But we want to make the following changes in the Spark dataframe while converting.

  • Change column name from “Code to Id” and “Product_Name to Insurance_Name“.
  • Currently the column “Code” has the values in long data type. We want to change the data type to integer.

Output

As shown below, the column name and data type of the columns are changed in the Spark dataframe.

Change column name and data type while converting pandas to spark dataframe
Change column name and data type while converting pandas to spark dataframe

Example 3: Use Apache Arrow for converting pandas to spark dataframe

Apache Arrow is a cross-language development platform for in-memory analytics. It is an in-memory columnar data format that is used in Spark to efficiently transfer data between JVM and Python processes.

Let’s set this configuration in our Pyspark program. It can be used for dataframe conversion with larger dataset.

Output

As shown below, the Pandas dataframe is converted to Spark dataframe using Apache arrow.

Convert pandas to spark dataframe using Apache arrow
Convert pandas to spark dataframe using Apache arrow

Example 4: Read from CSV file using Pandas on Spark dataframe

In Spark 3.2, Pandas API is introduced with a feature of “Scalability beyond a single machine“. It is enabling users to work with large datasets by leveraging Spark.

Since it scales well to large clusters of nodes, we can work with pandas on spark without converting it to spark dataframe. To do so, we need to “import pyspark.pandas as ps” in our Pyspark program.

Output

Example for Pandas on Spark dataframe
Example for Pandas on Spark dataframe

Recommend Articles

References