Regexp_extract function in Hive with examples

Regexp_extract function in hive

In the programming languages, Regular expression is used to search the certain pattern on the string values. The pattern is defined using the sequence of characters/single character. Similarly Hive supports regular expression using the following string functions.

  • Regexp_replace function replaces the string that matches a regular expression pattern with another string.
  • Regexp_extract function extracts a portion of the string that matches a regular expression pattern with another string.
  • RLIKE function to evaluate the LIKE operation using regular expression pattern.

In this tutorial, we will explain about the Regexp_extract function with examples.

Syntax of regexp_extract function

The subject is the input string.
The second argument is the pattern that we want to match on the input string.
The third argument is the index of the group to be returned.

index – 0 -> the entire match of the string
index – 1 -> the first group of pattern matching string
index – 2 -> the second group of pattern matching string
index – n -> the nth group. If the n is bigger number than the actual group, the hive query will fail.

Returns : The regexp_extract function returns a string value if the given pattern matches with the input string. Otherwise it will return a empty string.

Example to extract the domain name

Lets look at the regexp_extract function with an example. Here we have a Hive table named as emp_info which contains the employee details such as emp_id, name and email_id. The column email_id contains the email address of the employees.

Hive Table example
Hive Table example

Pattern of the email address

The pattern of the email address is username + symbol ‘@’ and then the domain name(gmail.com & hotmail.com). We will use this pattern in regexp_extract function to get the domain name from the email_id column.

Regular expression in regexp_extract function

In regular expression, dotstar (.*) means 0 or more of any character. So we are using the pattern as symbol ‘@’ followed by dotstar within the bracket (.*).

As we mentioned the index as 0, it returns the entire matching string such as symbol ‘@’ followed by domain name.

regexp_extract example in Hive
regexp_extract example in Hive

But we don’t want symbol ‘@’ in the domain name. In this example, the first group of pattern is (.*) which returns the matching string without symbol ‘@’. So we need to change the index value to 1 in the function.

Extract the domain name from the email address in Hive
Extract the domain name from the email address in Hive

Finally we got the domain name from the email address using the regexp_extract function of Hive.