Regexp_replace function in BigQuery with examples

Regexp_replace function in BigQuery

The Regexp_replace function is search the given regular expression pattern in the input string. Then it replace all occurrences of a substring that match a regular expression with another string.

Syntax of Regexp_replace function

  • value – Input string or column name of the table.
  • regexp – Regular expression pattern to match with the input string
  • replacement – It is a replacement string.

Grouping in Regular expression

In the replacement argument, we can use backslashed-escaped digits (\1 to \9). It allows us to get the part of the matched string if the pattern is grouped using parenthesis ( ). In general, Regular expression captured the matched string for each group. Using this digit, we can get the matched string of corresponding parenthesized group. Use \0 to refer to the entire matching text

While using these digits in the replacement, we need to add another backslash(\) to escape it. If the regexp argument is not a valid regular expression, this function returns an error.

Example for Regexp_replace function

Assume that we have a table Cloud_Account_Admin as below in BigQuery. It contains the following columns Admin_Id, Email_id, Address & Memory_limit.

Cloud_Account_Admin table in BigQuery
Cloud_Account_Admin table in BigQuery

The column Memory_limit defines the maximum amount of memory that admin can use in the cloud. But the values of the column does’t have the proper memory units. For example, units are mentioned as tera byte, terabytes, Tera Bytes and T.

Our goal is to replace all those units to a common unit as TB. Let’s write a query using Regexp_replace function in BigQuery

If we look at the pattern of the memory unit, all of them are starting with either T or t. So we can write a regular a expression as below

  • (?i) – It makes the regex case insensitive which means both upper and lower case character will be matched.
  • T – It denotes the starting character of the pattern.
  • . – The dot character match any character (except newline character)
  • * – It defines the zero or more character occurrences

Now we can use this regular expression pattern in the BigQuery. Let’s follow the syntax of REGEXP_REPLACE function to write the query.

As mentioned in the query, we have given the source column, regex pattern and replacement string in the Regexp_Replace function.

Regexp_replace function in BigQuery
Regexp_replace function in BigQuery

Since the regex pattern matched with the source column values, only the memory units are replaced with TB. The numeric values are not changed by Regexp_Replace function.

Example for Regular expression grouping

Regular expression allows us to define the pattern with multiple groups. If we need a matched string for a specific group, we can get it using backslashed-escaped digits (\1 to \9).

For this example, we will consider the column Email_id in the same table Cloud_Account_Admin. Since the email address follows the pattern as user name + symbol (@) and domain name( or, we can use regular expression grouping to extract the part of matched string.

Regex pattern for Email address

In the below regex pattern, we grouped the pattern using parenthesis ( ). This will be used to fetch the email address as well as part of the email address (user or domain name)

  • ([0-9a-zA-Z]+) – It is a first regex group to match the user name of the email_id
  • @ – It is defined to match the @ character in email.
  • ([a-z]+) – It is a second regex group to match the domain name of the email address.

In this example, we will generate the mail server name using Regexp_replace function. Consider that mail server name starts with smtp. + domain name of the email address.

As mentioned in the query, we are extracting the matched string from the second regex group(\\2). So it will return the domain name of the address. Then using the Regexp_replace function, we are adding a string smtp. to the domain name. This will return a mail_server name as output.

Regular expression grouping using Regexp_replace function in BigQuery
Regular expression grouping using Regexp_replace function in BigQuery

As we shown in the above image, we got the mail server as Similarly we can get the user name of the email address using first regex group (\\1). To do so, we modified the regex pattern as below.

Extract user name from email address
Extract user name from email address

Also the regex group (\\0) can be used to return the entire matching string. In this case, the Regexp_replace function returns the complete email id.

Recommended Articles