Join us at Snowflake Summit June 26-29 in Las Vegas!

Regular Expression Matching and Capture Groups in Rel

Murali Pusala

19 January 2022

less than a minute read

The Rel library has included basic regular expression support for a while. For example, regex_match tests whether a string matches a regular expression.

regex_match("^.*@.*$", "some@example.com")

The relation string_replace also supports regular expressions, for example string_trim is defined using a regular expression:

def string_trim[s] = string_replace[
    s, regex_compile["^\\s+|\\s+$"], ""]

Until now, it was not yet possible to extract matching substrings using regular expressions. We are excited to announce that we have now added support for this.

The relation regex_match_all finds all substrings in a string that match the regular expression. The relation includes the matched substring as well as corresponding offsets.

regex_match_all["(cat|dog)s?", "cats are not dogs"]

Relation:

1"cats"
2"dogs"

We also introduce the capture_group_by_index relation to capture a substring that matches groups in a regular expression. This relation searches for matches in an input string starting from a given offset.

Each group in the regular expression is automatically given a unique number starting with 1.

def email = "john.doe@example.com"
def pattern = "^(.*)@(.*).com$"

def output = email, capture_group_by_index[pattern, email, 1]

Relation:

"john.doe@example.com"1"john.doe"
"john.doe@example.com"2"example"

Along with numerical index, Rel supports regular expressions with named capture groups. The capture_group_by_name relation includes the captured substring for the corresponding group name.

def my_string = "Meeting is at 11:45 AM"
def pattern = "(?<hour>\\d+):(?<minute>\\d+)"

def output = capture_group_by_name[pattern, my_string, 1]

Relation:

"hour""11"
"minute""45"

The regular expression capabilities are implemented using the foreign function interfaces, but these relations are designed to be used as any relation. For example, when a specific capture group is needed, it can be specified upfront, as illustrated in this example:

def my_group = capture_group_by_name[
    "^.*@(?<domain>.*)\\.com$", "foo@example.com", 1]

def output = my_group["domain"]

Relation: "example"

With the new regular expression features we expect to cover more of the common data engineering use-cases. We’re excited to learn about how you are using Rel – please let us know about any future features you’d like to see.

Related Posts

Get Early Access

Join our community, keep up to date with the latest developments in our monthly newsletter, and get early access to RelationalAI.