Murali Pusala
19 January 2022
less than a minute read
The Rel library has included basic regular expression support for a while. For example, regex_match tests whether a string matches a regular expression.
regex_match("^.*@.*$", "some@example.com")
The relation string_replace
also supports regular expressions, for example string_trim
is defined using a regular expression:
def string_trim[s] = string_replace[
s, regex_compile["^\\s+|\\s+$"], ""]
Until now, it was not yet possible to extract matching substrings using regular expressions. We are excited to announce that we have now added support for this.
The relation regex_match_all
finds all substrings in a string that match the regular expression. The relation includes the matched substring as well as corresponding offsets.
regex_match_all["(cat|dog)s?", "cats are not dogs"]
Relation:
1 | "cats" |
2 | "dogs" |
We also introduce the capture_group_by_index
relation to capture a substring that matches groups in a regular expression. This relation searches for matches in an input string starting from a given offset
.
Each group in the regular expression is automatically given a unique number starting with 1.
def email = "john.doe@example.com"
def pattern = "^(.*)@(.*).com$"
def output = email, capture_group_by_index[pattern, email, 1]
Relation:
"john.doe@example.com" | 1 | "john.doe" |
"john.doe@example.com" | 2 | "example" |
Along with numerical index, Rel supports regular expressions with named capture groups. The capture_group_by_name
relation includes the captured substring for the corresponding group name.
def my_string = "Meeting is at 11:45 AM"
def pattern = "(?<hour>\\d+):(?<minute>\\d+)"
def output = capture_group_by_name[pattern, my_string, 1]
Relation:
"hour" | "11" |
"minute" | "45" |
The regular expression capabilities are implemented using the foreign function interfaces, but these relations are designed to be used as any relation. For example, when a specific capture group is needed, it can be specified upfront, as illustrated in this example:
def my_group = capture_group_by_name[
"^.*@(?<domain>.*)\\.com$", "foo@example.com", 1]
def output = my_group["domain"]
Relation: "example"
With the new regular expression features we expect to cover more of the common data engineering use-cases. We’re excited to learn about how you are using Rel – please let us know about any future features you’d like to see.