THIS POST IS ARCHIVED

Regular Expression Matching and Capture Groups in Rel

The Rel library has included basic regular expression support for a while. For example, regex_match tests whether a string matches a regular expression.

regex_match("^.*@.*$", "some@example.com")

The relation string_replace also supports regular expressions, for example string_trim is defined using a regular expression:

def string_trim[s] = string_replace[
    s, regex_compile["^\\s+|\\s+$"], ""]

Until now, it was not yet possible to extract matching substrings using regular expressions. We are excited to announce that we have now added support for this.

The relation regex_match_all finds all substrings in a string that match the regular expression. The relation includes the matched substring as well as corresponding offsets.

def output = regex_match_all["(cat|dog)s?", "cats are not dogs"]

Relation:

We also introduce the capture_group_by_index relation to capture a substring that matches groups in a regular expression. This relation searches for matches in an input string starting from a given offset.

Each group in the regular expression is automatically given a unique number starting with 1.

def email = "john.doe@example.com"
def pattern = "^(.*)@(.*).com$"

def output = email, capture_group_by_index[pattern, email, 1]

Relation:

Along with numerical index, Rel supports regular expressions with named capture groups. The capture_group_by_name relation includes the captured substring for the corresponding group name.

def my_string = "Meeting is at 11:45 AM"
def pattern = "(?<hour>\\d+):(?<minute>\\d+)"

def output = capture_group_by_name[pattern, my_string, 1]

Relation:

The regular expression capabilities are implemented using the foreign function interfaces, but these relations are designed to be used as any relation. For example, when a specific capture group is needed, it can be specified upfront, as illustrated in this example:

def my_group = capture_group_by_name[
    "^.*@(?<domain>.*)\\.com$", "foo@example.com", 1]

def output = my_group["domain"]

Relation:

With the new regular expression features we expect to cover more of the common data engineering use-cases. We’re excited to learn about how you are using Rel --- please let us know about any future features you’d like to see.