Using Rel for Machine Learning Data Preprocessing
Rel, RelationalAI’s expressive relational language, has many advantages for modeling domain knowledge. In this post, we use Rel to extract knowledge from raw data to feed into machine learning models.
You will see that Rel code is shorter and easier to read compared with the SQL queries that perform the same data transformation.
In this post, we will perform a fairly common recommendation task, where we want to use previous purchases to predict the next purchase.
First, let’s take a look at some raw data. For every customer, we have a customer ID. Every purchase that a customer makes has a date and the ID(s) of the purchased item(s). Below is an illustration of sample data for a customer with ID 100.
In a SQL database, the raw data can be stored in a table with three fields:
customer_ID, date, item_ID
In Rel, this same data is an arity-3 relation:
raw_purchases(customer_ID, date, item_ID)
Next, let’s define the information we need to feed into the machine learning model. For every purchased item, we want to associate items purchased in all previous transactions by the same user. The final output file (usually in CSV format) will have entries for a customer with ID 100 in the following form:
customer_id, date, item_id, previous_purchase_item_id
...
100, 2023-02-13, 87526, "12507, 34089"
100, 2023-02-13, 14589, "12507, 34089"
100, 2023-03-15, 52896, "12507, 34089, 14589, 87526"
100, 2023-03-15, 74123, "12507, 34089, 14589, 87526"
Now we know the format of the raw data upstream, as well as the expected data format for the downstream machine learning task.
Let’s look at how to implement this data pipeline in SQL. To make the query more efficient, we want to minimize the total number of join operations in the data pipeline.
Note that we use a regular expression with a window function to collect all previously purchased items apart from the ones in the current purchase. Although it’s not uncommon in SQL to reduce the number of join operations, it makes the query difficult to understand at first glance.
with purchases as (
select
customer_id,
date,
item_id
from raw.transaction_purchases
), purchases_index as(
select
customer_id,
date,
dense_rank() over (partition by customer_id order by date) purchase_index
from purchases
group by customer_id, date
order by customer_id, date
), collect_items_in_purchase as (
select
pc.customer_id,
pc.date,
pt.purchase_index,
item_id,
row_number() over (partition by pc.customer_id, pt.purchase_index order by pc.date) purchase_item_index,
string_agg(cast(item_id as string)) over (partition by pc.customer_id, pt.purchase_index) items_in_purchase
from purchases a
inner join purchases_index b
on (a.date = b.date and a.customer_id = b.customer_id)
), collect_items_in_previous_purchases as (
select
customer_id,
date,
purchase_index,
item_id,
purchase_item_index,
split(
replace(
regexp_extract(
string_agg(
case when purchase_item_index=1
then items_in_purchase
else null
end,
';'
) over (partition by customer_id order by date),
r'(.*?);[^;]+$'
),
';',
','
),
','
) previous_purchases_items
from collect_items_in_purchase
), final as (
select
customer_id,
date,
item_id,
string_agg(distinct previous_item_id) previous_purchase_items
from collect_items_in_previous_purchases
left join unnest(previous_purchases_items) as previous_item_id
group by customer_id, date, item_id
)
select * from final
Next, let’s look at how Rel handles this data transformation. In the code snippet below, we see that the raw sample data can be defined quite easily in the Rel relation raw_purchases
.
This is a very handy feature, especially at the beginning of a project. It allows you to focus on implementing data logic without worrying about handling the actual raw data, which can be large in size and have data issues here and there.
It is worth noting that the Rel relation raw_purchases
, as well as all other relations defined in this code snippet, are in Graph Normal Form (GNF).
Graph Normal Form (GNF) is RAI’s implementation of Sixth Normal Form. In GNF, each relation has one or more key columns and just one value column.
This property enables us to express data transformations concisely and with great readability, as we shall see in the following part of the code snippet. You can also learn more about GNF in this post.
Back to the Rel code. The logic to generate the required data output for the machine learning model, equivalent to the SQL query shown above, can be expressed in one Rel relation, called previous_purchases
.
//schema: customer_id,date,item_id
def raw_purchases = {
(100,2023-01-23,12507);
(100,2023-01-23,34089);
(100,2023-02-13,87526);
(100,2023-02-13,14589);
(100,2023-03-15,52896);
(100,2023-03-15,74123);
}
// for every purchase item, find all items from previous purchases
def previous_purchases(user, date1, item1, date0, item0) =
raw_purchases(user, date1, item1) and
raw_purchases(user, date0, item0) and
date0 < date1
So what exactly is accomplished in this seemingly brief Rel relation? Let’s break it down a bit.
On the right-hand side of the equal sign, we first find the date and item combination of a user purchase from the raw data table using the relation raw_purchases(user, date1, item1)
.
Then, we look for the date and item combination of another purchase by the same user from the raw data table using the relation raw_purchases(user, date0, item0)
.
Note that we refer to the raw data relation raw_purchases
twice and use the same variable user for the user dimension. The first time, for the date and time tuple, we use the variables (date1, item1)
and the second time we use different variables (date0, item0)
. We know that the second purchase occurs before the first, so we have date0 < date1
.
Finally, all tuples (user, date1, item1, date0, item0)
, whose values satisfy logic on the right-hand side are collected and stored in the relation previous_purchases
on the left-hand side.
And there we have it, we have associated all items purchased before date1
, i.e., item0
purchased on date0
, with a given purchased item, i.e., item1
purchased on date1
, by a user
.
The data stored in the Rel relation previous_puchases
looks like the following. The first column is the user ID, the second and third columns are the current purchase time and item, and the fourth and fifth columns are the previous purchase time and item.
Although in a different format, it actually has everything we need for the downstream machine learning task.
To get the desired output format, we can use two helper relations that are defined as follows:
The Rel relation previous_purchases_string
, which sorts the previous date and item combination, i.e.,(date0, item0)
in ascending order by time and converts the integer item number to a string.The Rel relation output_data
, which concatenates all previously purchased items in ascending order by time, separated by a comma, for each(user, date1, item1)
tuple using a function calledstring_join
.
// prepare output format
def previous_purchases_string(user, date1, item1, i, item_string) =
sort[date0, item0: previous_purchases(user, date1, item1, date0, item0)](i, _, item) and
item_string = string[item]
from item
def output_data[user, date1, item1] = string_join[", ", previous_purchases_string[user, date1, item1]]
Here is the tabular view of the data in the Rel relation output_data
, which exactly matches the expected CSV output we outlined earlier.
In summary, Rel can be a powerful tool for your machine learning data preprocessing for many reasons, including:
Because of the use of Graph Normal Form (GNF) relations, Rel code is concise and very readable. Easily defined data relations facilitate testing and debugging in development.
Finally, Rel can significantly simplify your machine learning data pipeline. If you have some data transformation logic that seems too complex to express in SQL, give Rel a try.