Regexp_replace function in BigQuery with examples
Contents
Regexp_replace function in BigQuery
The Regexp_replace function is search the given regular expression pattern in the input string. Then it replace all occurrences of a substring that match a regular expression with another string.
Syntax of Regexp_replace function
1 |
REGEXP_REPLACE(value, regexp, replacement) |
- value – Input string or column name of the table.
- regexp – Regular expression pattern to match with the input string
- replacement – It is a replacement string.
Grouping in Regular expression
In the replacement argument, we can use backslashed-escaped digits (\1 to \9). It allows us to get the part of the matched string if the pattern is grouped using parenthesis ( ). In general, Regular expression captured the matched string for each group. Using this digit, we can get the matched string of corresponding parenthesized group. Use \0 to refer to the entire matching text.
While using these digits in the replacement, we need to add another backslash(\) to escape it. If the regexp argument is not a valid regular expression, this function returns an error.
Example for Regexp_replace function
Assume that we have a table Cloud_Account_Admin as below in BigQuery. It contains the following columns Admin_Id, Email_id, Address & Memory_limit.
The column Memory_limit defines the maximum amount of memory that admin can use in the cloud. But the values of the column does’t have the proper memory units. For example, units are mentioned as tera byte, terabytes, Tera Bytes and T.
Our goal is to replace all those units to a common unit as TB. Let’s write a query using Regexp_replace function in BigQuery
If we look at the pattern of the memory unit, all of them are starting with either T or t. So we can write a regular a expression as below
1 |
(?i)T.* |
- (?i) – It makes the regex case insensitive which means both upper and lower case character will be matched.
- T – It denotes the starting character of the pattern.
- . – The dot character match any character (except newline character)
- * – It defines the zero or more character occurrences
Now we can use this regular expression pattern in the BigQuery. Let’s follow the syntax of REGEXP_REPLACE function to write the query.
1 2 3 4 5 |
SELECT Admin_Id, Email_id, Address, REGEXP_REPLACE(memory_limit,'(?i)T.*','TB') as Memory_limit FROM rc_fin_test_tables.Cloud_Account_Admin; |
As mentioned in the query, we have given the source column, regex pattern and replacement string in the Regexp_Replace function.
Since the regex pattern matched with the source column values, only the memory units are replaced with TB. The numeric values are not changed by Regexp_Replace function.
Example for Regular expression grouping
Regular expression allows us to define the pattern with multiple groups. If we need a matched string for a specific group, we can get it using backslashed-escaped digits (\1 to \9).
For this example, we will consider the column Email_id in the same table Cloud_Account_Admin. Since the email address follows the pattern as user name + symbol (@) and domain name(gmail.com or yahoo.com), we can use regular expression grouping to extract the part of matched string.
Regex pattern for Email address
In the below regex pattern, we grouped the pattern using parenthesis ( ). This will be used to fetch the email address as well as part of the email address (user or domain name)
1 |
([0-9a-zA-Z]+)@([a-z]+) |
- ([0-9a-zA-Z]+) – It is a first regex group to match the user name of the email_id
- @ – It is defined to match the @ character in email.
- ([a-z]+) – It is a second regex group to match the domain name of the email address.
In this example, we will generate the mail server name using Regexp_replace function. Consider that mail server name starts with smtp. + domain name of the email address.
1 2 3 4 5 |
SELECT Admin_Id, Email_id, REGEXP_REPLACE(Email_id,'([0-9a-zA-Z]+)@([a-z]+)','smtp.\\2') as mail_server FROM rc_fin_test_tables.Cloud_Account_Admin; |
As mentioned in the query, we are extracting the matched string from the second regex group(\\2). So it will return the domain name of the address. Then using the Regexp_replace function, we are adding a string smtp. to the domain name. This will return a mail_server name as output.
As we shown in the above image, we got the mail server as smtp.gmail.com. Similarly we can get the user name of the email address using first regex group (\\1). To do so, we modified the regex pattern as below.
1 2 3 4 5 |
SELECT Admin_Id, Email_id, REGEXP_REPLACE(Email_id,'([0-9a-zA-Z]+)@([a-z]+)\\.([a-z]+)','\\1') as user_name_in_email FROM rc_fin_test_tables.Cloud_Account_Admin; |
Also the regex group (\\0) can be used to return the entire matching string. In this case, the Regexp_replace function returns the complete email id.
Recommended Articles
- String REPLACE function in BigQuery with examples
- Substring function in BigQuery with examples
- Regexp_extract function in BigQuery with examples
- How to flatten an array using UNNEST function in BigQuery?