Regexp_substr function in Teradata
Teradata supports substring function to extract a portion of the string. But It doesn’t allow regular expression to get the string with specific pattern. On the other hand, Regexp_substr function extracts a substring based on regular expression.
In this tutorial, we will explain about the Regexp_substr function with examples.
Syntax of Regexp_substr function
regexp_substr(source_string, regexp_string, position_arg, count_arg, match_arg)
source_string – It is a input string, If source_string is NULL, NULL is returned.
regexp_string – It is a regular expression string that we want to match with input string. If regexp_string is NULL, NULL is returned.
position_arg -> Optional. It is a numeric argument. It specifies the position in the source string from which to start searching. (default value is 1).
count_arg -> Optional. It is a numeric argument. It specifies the number of occurrence to be returned. (default value is 1).
match_arg -> Optional. It is a character argument. The valid values are listed below
‘i’ = case-insensitive matching.
‘c’ = case sensitive matching.
‘n’ = To treat the period character (match any character) as new line character
‘m’ = It is used to treat the source_string as multiple lines instead of as a single line.
‘l’ = if source_string exceeds the current maximum allowed source_string size (currently 16 MB), a NULL is returned instead of an error.
‘x’ = It ignores the whitespace.
The match_arg argument allows to specify more than one character. If a character in the argument is not valid, it ignore that character. If match_arg is not specified, is NULL, or is empty:
- The match on the input string is case sensitive.
- A period does not match the newline character.
- The source_string is treated as a single line.
Returns: REGEXP_SUBSTR returns a string value if the given pattern matches with the input string. Otherwise it will return NULL.
Example to extract the substring based on regular expression
In the ecommerce website, customer provide the review on the product which they bought from the website. Lets analyze those customer reviews using REGEXP_SUBSTR function. It is kind of customer sentiment analysis.
Consider that we have a table customer_feed_back as below in Teradata. In that table, we have the customer comment details for the product Graphic Tablet.
The column CUST_COMMENTS have the customer comment details. Lets find the negative comments using the REGEXP_SUBSTR function. To find that, we are using set of negative words with regular expression. These words reflect the quality of the product.
- not good
- not happy
- not user-friendly
- low quality
- not working
User may give the comments in smaller or upper case letters. To handle that, we are ignoring the case using regular expression pattern. The below regular expression extract the negative comments.
- ?i is search the string in case-insensitive mode.
- \s* is defines the space between the word
- | is denotes OR condition in regex.
Now we are ready to write the select query with the REG_SUBSTR function as below
REGEXP_SUBSTR (CUST_COMMENTS ,
'(?i)issue|not\s*good|not\s*happy|not\s*user-friendly|low\s*quality|not\s*working|bad') as Negative_Comments
The select query returns the output as below.
The comments are containing negative words as not working and BAD. So the function REGEXP_SUBSTR returns those words for customer id 8048 and 5894. For the not matching comments, it returned the NULL value.