How to concatenate columns in Spark dataframe?

Concatenate functions in Spark SQL

The module pyspark.sql.functions provides two concatenate functions as below

  • concat – It concatenates multiple input columns together into a single column.
  • concat_ws – It concatenates multiple input string columns together into a single string column, with the given separator.

In this tutorial, we use PySpark program to explain the concatenate functions.

Syntax

  • string_1, string_2,…string_n – The input strings to add together.
  • separator – It is a separator string/character.

Example 1: Using concat() function to join columns in Spark dataframe

First we are creating the Spark dataframe with customer’s street and city details.

Output

Create dataframe for concat function in PySpark
Create dataframe for concat function in PySpark

Let’s join street and city columns to get the mailing address of the customer. As shown below, we have used concat function to concatenate columns “Street” and “City” in Spark dataframe. Also, Alias function is used to give the name for the derived column. The new dataframe is saved with the name of address_df.

Output

As shown below, the derived column “Address” has the concatenated values in the dataframe “address_df

Output of Concat function in Spark dataframe
Output of Concat function in Spark dataframe

Let’s add a separator comma(,) between street and city value while concatenating string. So that it will be easier to understand the address. Also we are going to select the existing columns street and city for creating the dataframe “address_df“.

  • lit() function is used to add the literal or constant value. Here we mentioned comma(,) as literal value.
  • col() function is used to return the column based on the given column name.

Output

Alias function in concatenate operation of Spark dataframe
Alias function in concatenate operation of Spark dataframe

Example 2: Concatenate columns using concat() and withColumn() function

Spark withColumn() function is mainly used for transformation. We can use this function to create the new column. Let’s use withColumn() in the concatenate operation. Please make sure that the new column name is not present in the dataframe. Otherwise withColumn() modify the value of that column.

Output

As shown below, the function withColumn() is created the new column “Address” in the dataframe address_df. Since we mentioned the concat operation inside the withColumn() function, it returned the concatenated values.

Concatenate values using concat() and withColumn() functions in PySpark
Concatenate values using concat() and withColumn() functions in PySpark

Example 3: Using concat_ws() function to concatenate with delimiter

In the previous example, we used lit() function to add the separator comma(,) between Street and City values. But these kind of separator can be added using concat_ws() function. Let’s modify the previously written code with concat_ws() function in PySpark.

Output

As shown below, the separator comma (,) is added between Street and City values and new column is created as “Address” in the dataframe “address_df“.

Please note that N number of columns can be specified in the concat_ws function. The function adds given separator between each and every columns.

concat_ws() function to add the separator in Spark dataframe
concat_ws() function to add the separator in Spark dataframe

Example 4: Using concat() on the temporary view of Spark dataframe

The concatenate operation can be done as regular SQL statement in Spark. For that, we need to create the temporary view from the Spark dataframe.

  • First use createOrReplaceTempView() for creating the temporary view with the name as “CustomerDetails“.
  • Then write the select query with concat function on the view name.

Output

Since we used spark.sql() function to write the sql, it returned the expected values from the temporary view.

Concatenate from the temporary view in PySpark
Concatenate from the temporary view in PySpark

Complete Example

Recommended Articles

References