Using Rel for Machine Learning Data Preprocessing

Rel, RelationalAI’s expressive relational language, has many advantages for modeling domain knowledge. In this post, we use Rel to extract knowledge from raw data to feed into machine learning models.

You will see that Rel code is shorter and easier to read compared with the SQL queries that perform the same data transformation.

In this post, we will perform a fairly common recommendation task, where we want to use previous purchases to predict the next purchase.

First, let’s take a look at some raw data. For every customer, we have a customer ID. Every purchase that a customer makes has a date and the ID(s) of the purchased item(s). Below is an illustration of sample data for a customer with ID 100.

In a SQL database, the raw data can be stored in a table with three fields:

customer_ID, date, item_ID

In Rel, this same data is an arity-3 relation:

raw_purchases(customer_ID, date, item_ID)

Next, let’s define the information we need to feed into the machine learning model. For every purchased item, we want to associate items purchased in all previous transactions by the same user. The final output file (usually in CSV format) will have entries for a customer with ID 100 in the following form:

customer_id, date, item_id, previous_purchase_item_id
...
100, 2023-02-13, 87526, "12507, 34089"
100, 2023-02-13, 14589, "12507, 34089"
100, 2023-03-15, 52896, "12507, 34089, 14589, 87526"
100, 2023-03-15, 74123, "12507, 34089, 14589, 87526"

Now we know the format of the raw data upstream, as well as the expected data format for the downstream machine learning task.

Let’s look at how to implement this data pipeline in SQL. To make the query more efficient, we want to minimize the total number of join operations in the data pipeline.

Note that we use a regular expression with a window function to collect all previously purchased items apart from the ones in the current purchase. Although it’s not uncommon in SQL to reduce the number of join operations, it makes the query difficult to understand at first glance.

with purchases as (
    select
	   customer_id,
	   date,
   item_id
    from raw.transaction_purchases
), purchases_index as(
    select
        customer_id,
        date,
        dense_rank() over (partition by customer_id order by date) purchase_index
        from purchases
        group by customer_id, date
        order by customer_id, date
), collect_items_in_purchase as (
    select
        pc.customer_id,
        pc.date,
        pt.purchase_index,
        item_id,
        row_number() over (partition by pc.customer_id, pt.purchase_index order by pc.date) purchase_item_index,
        string_agg(cast(item_id as string)) over (partition by pc.customer_id, pt.purchase_index) items_in_purchase
    from purchases a
    inner join purchases_index b
    on (a.date = b.date and a.customer_id = b.customer_id)
), collect_items_in_previous_purchases as (
    select
        customer_id,
        date,
        purchase_index,
        item_id,
        purchase_item_index,
        split(
            replace(
                regexp_extract(
                    string_agg(
                        case when purchase_item_index=1
                        then items_in_purchase
                        else null
                        end,
                        ';'
                    ) over (partition by customer_id order by date),
                    r'(.*?);[^;]+$'
                ),
                ';',
                ','
            ),
            ','
        ) previous_purchases_items
    from collect_items_in_purchase
), final as (
    select
        customer_id,
        date,
        item_id,
        string_agg(distinct previous_item_id) previous_purchase_items
    from collect_items_in_previous_purchases
    left join unnest(previous_purchases_items) as previous_item_id
    group by customer_id, date, item_id
)
    select * from final

Next, let’s look at how Rel handles this data transformation. In the code snippet below, we see that the raw sample data can be defined quite easily in the Rel relation raw_purchases.

This is a very handy feature, especially at the beginning of a project. It allows you to focus on implementing data logic without worrying about handling the actual raw data, which can be large in size and have data issues here and there.

It is worth noting that the Rel relation raw_purchases, as well as all other relations defined in this code snippet, are in Graph Normal Form (GNF).

Graph Normal Form (GNF) is RAI’s implementation of Sixth Normal Form. In GNF, each relation has one or more key columns and just one value column.

This property enables us to express data transformations concisely and with great readability, as we shall see in the following part of the code snippet. You can also learn more about GNF in this post.

Back to the Rel code. The logic to generate the required data output for the machine learning model, equivalent to the SQL query shown above, can be expressed in one Rel relation, called previous_purchases.

//schema: customer_id,date,item_id
def raw_purchases = {
(100,2023-01-23,12507);
(100,2023-01-23,34089);
(100,2023-02-13,87526);
(100,2023-02-13,14589);
(100,2023-03-15,52896);
(100,2023-03-15,74123);
}


// for every purchase item, find all items from previous purchases
def previous_purchases(user, date1, item1, date0, item0) =
    raw_purchases(user, date1, item1) and
    raw_purchases(user, date0, item0) and
    date0 < date1

So what exactly is accomplished in this seemingly brief Rel relation? Let’s break it down a bit.

On the right-hand side of the equal sign, we first find the date and item combination of a user purchase from the raw data table using the relation raw_purchases(user, date1, item1).

Then, we look for the date and item combination of another purchase by the same user from the raw data table using the relation raw_purchases(user, date0, item0).

Note that we refer to the raw data relation raw_purchases twice and use the same variable user for the user dimension. The first time, for the date and time tuple, we use the variables (date1, item1) and the second time we use different variables (date0, item0). We know that the second purchase occurs before the first, so we have date0 < date1.

Finally, all tuples (user, date1, item1, date0, item0), whose values satisfy logic on the right-hand side are collected and stored in the relation previous_purchases on the left-hand side.

And there we have it, we have associated all items purchased before date1, i.e., item0 purchased on date0, with a given purchased item, i.e., item1 purchased on date1, by a user.

The data stored in the Rel relation previous_puchases looks like the following. The first column is the user ID, the second and third columns are the current purchase time and item, and the fourth and fifth columns are the previous purchase time and item.

Although in a different format, it actually has everything we need for the downstream machine learning task.

To get the desired output format, we can use two helper relations that are defined as follows:

The Rel relation previous_purchases_string, which sorts the previous date and item combination, i.e., (date0, item0) in ascending order by time and converts the integer item number to a string.
The Rel relation output_data, which concatenates all previously purchased items in ascending order by time, separated by a comma, for each (user, date1, item1) tuple using a function called string_join.

// prepare output format
def previous_purchases_string(user, date1, item1, i, item_string) =
    sort[date0, item0: previous_purchases(user, date1, item1, date0, item0)](i, _, item) and
    item_string = string[item]
    from item

def output_data[user, date1, item1] = string_join[", ", previous_purchases_string[user, date1, item1]]

Here is the tabular view of the data in the Rel relation output_data, which exactly matches the expected CSV output we outlined earlier.

In summary, Rel can be a powerful tool for your machine learning data preprocessing for many reasons, including:

Because of the use of Graph Normal Form (GNF) relations, Rel code is concise and very readable.
Easily defined data relations facilitate testing and debugging in development.

Finally, Rel can significantly simplify your machine learning data pipeline. If you have some data transformation logic that seems too complex to express in SQL, give Rel a try.