Hossein Keshavarz, Yasmeen Hany, Pigi Kouki
27 March 2023
5 min read
Recommender systems are one of the most successful and widely used applications of machine learning. Their use cases span a range of industry sectors such as e-commerce, entertainment, and social media.
Recommender systems drive multiple aspects of our everyday life. For example, when a user views a book on Amazon, Amazon offers a list of recommendations offered as “Customers Who Bought This Item Also Bought…”.
An example of a recommender system from Amazon.
In this post, we focus on a fundamental and effective classical approach to recommender systems, which is neighborhood-based.
Neighborhood-based recommender systems fall under the collaborative filtering umbrella and focus on using behavioral patterns, such as movies that users have watched in the past, to identify similar users (i.e., users who demonstrate similar preferences), or similar items (i.e., items that receive similar interest from the same users).
We refer to the technique that computes similar users as user-based and to the technique that focuses on computing similar items as item-based.
An example of the item-based technique is Netflix’s “Because you watched…” feature, which recommends movies or shows based on examples that users previously showed interest in.
An example of a user-based recommender system is booking.com, which recommends destinations based on the historical behavior of other users with similar travel history.
Examples of user-based and item-based approaches.
The image below summarizes the pipeline for our implementation of item-based and user-based recommender systems in our declarative language, Rel. Without loss of generality, we focus on a movie recommendation use case, where we are given interactions between users and movies.
The pipeline of the neighborhood-based recommender system.
For this implementation, we use a version of the popular MovieLens100K dataset containing user-movie interactions, i.e., the presence of a (useri, moviei) pair means that useri has watched moviei.
In what follows, we cover every component of the architecture and explain how to implement it in Rel.
The first step is to convert the input interactions data to a bipartite graph that contains two types of nodes:
Movies, as shown in the image below.
The bipartite graph representation of the MovieLens dataset.
The two node types are connected by an edge that we call
watched. In Rel,
Movies are represented by entity types, and their attributes, such as
name, are represented by value types.
// install entity type Movie = Int entity type User = Int value type Id = Int value type Name = String
Once we define the entity and value types, the next step is to populate the entities with data from the original MovieLens dataset.
Assuming we have a relation called
watched_train(user, movie) that represents the train subset of the MovieLens data and contains the watch history for the users, and a relation called
movie_info(movie, movie_name)that contains the movie names, we create the
Movie entity as follows:
// update module movie_info // entity node: Movie def Movie = ^Movie[m : watched_train(_, m)] // edge: has_id def has_id(movie, id) = movie = ^Movie[m] and id = ^Id[m] and watched_train(_, m) from m // edge: has_name def has_name(movie, name) = movie = ^Movie[m] and name = ^Name[n] and watched_train(_, m) and movie_info(m, n) from m,n end // store the data in the `MovieGraph` base relation def insert:MovieGraph = movie_info
User entity is created similarly. Finally, we add an additional edge called
watched that connects the
movie entity to the
module watched_info //edge: watched def watched(user, movie) = user = ^User[u] and movie = ^Movie[m] and watched_train[u, m] from u,m // a transposed version of `watched` to be used in the item-based approach def watched_t(movie, user) = watched(user, movie) end // storing the data in the `MovieGraph` base relation def insert:MovieGraph = watched_info
Now that we have modeled our data as a graph, we can compute item-item and user-user similarities using the user-item interactions: movies that have been watched by the same users will have a high similarity value, while movies that have been watched by different users will have a low similarity value.
Here, we focus on the item-based method. The approach for the user-based method is very similar. There are several similarity metrics that can be used for this task. Currently, the Rel graph library provides the
jaccard_similarity relations, which can be used as follows:
with MovieGraph use rated, User, Movie def G = undirected_graph[rated] @ondemand def L = rel:graphlib[G] def similarities(m1, score, m2) = score = L:jaccard_similarity[m1, m2] and // compute only movie-movie similarities Movie[m1] and Movie[m2] and // filter out self-similarities and 0 score similarities m1 != m2 and score != 0
The relation similarities now contains movie-movie Jaccard similarity scores.
In addition to the cosine and jaccard similarities, we can define any similarity metric. In Rel, expressing mathematical formulas is straightforward.
Here is an example defining a custom version of the dice similarity metric:
// install @outline module dice_similarity[shrink] def similarity_score[:binary, R1, R2](score)= n = count[x : R1(x) and R2(x)] and d_1 = count[x : R1(x)] and d_2 = count[x : R2(x)] and score = decimal[64, 4, n / (d_1 + d_2 + shrink + power[10, -6.0])] from n, d_1, d_2 def similarity_score[:values, R1, R2](score)= n = sum[x : R1[x] * R2[x]] and d_1 = sum[x : R1[x]] and d_2 = sum[x : R2[x]] and score = decimal[64, 4, n / (d_1 + d_2 + shrink + power[10, -6.0])] from n, d_1, d_2 end
Using the similarities calculated in the previous step, we then compute the (user, movie) scores for all pairs. We predict that a user will watch movies that are similar to the movies they have watched in the past (item-based approach).
The score for a pair (user, movie) indicates how likely it is for a user to watch a movie and is calculated as follows:
In other words, the score is the sum of the similarity scores of the target movie’s nearest neighbors that have been watched by the target user.
The Rel code for the equation above is:
// install // relation that predicts the score of a (user,movie) pair @inline def pred[:item_based, :binary, neighborhood_size, M, S, T](user, score, movie) = score = sum[T[neighborhood_size, S][movie,n_movie] for n_movie // the neighbor should be watched by the target user where M(n_movie, user) and // no score should be calculated for (user,movie) that appears in the train data not M(movie, user)]
pred relation takes the following inputs:
neighborhood_size: The number of similar movies (neighbors) we use to predict the score
M: The relation containing (movie, user) pairs
S: The similarity metric, e.g., jaccard, cosine
T: The relation that selects the top
neighborhood_sizemost similar movies to the target movie (i.e., the nearest neighbors)
To generate a list of k recommendations for a user, the last step is to rank all the scores for the particular user from highest to lowest, and then select the top k scoring movies.
Assuming we have a relation called
test_items_scores(user, score, movie), we generate the top-k (where k=10 for this example) recommendations as follows:
// query // define the size of the recommendations list def k_recommendations = 10 // select the top k scores def top_k_recommendations[user, rank, movie] = bottom[k_recommendations, test_items_scores[user]](rank, score, movie) from score
To evaluate performance, we define a set of evaluation metrics that are widely used, i.e.,: precision, recall, NDCG, and hit rate.
These metrics compute how much our recommended items overlap with the true watched items of a given user that exists in a hidden test set.
Below is the Rel code that implements precision and recall.
// install @inline def precision_at_k[k, PREDICTED_ITEMS, TARGET_ITEMS](score) = score = count[movie : PREDICTED_ITEMS(i,movie) and TARGET_ITEMS(movie) from i where i <= k] / k @inline def recall_at_k[k, PREDICTED_ITEMS, TARGET_ITEMS](score) = score = count[movie : PREDICTED_ITEMS(i,movie) and TARGET_ITEMS(movie) from i where i <= k] / count[movie : TARGET_ITEMS(movie)]
Now that we have defined the necessary components for the neighborhood-based recommender system, let’s review a sample run that generates recommendations and evaluates them using the item-based method:
// query with MovieGraph use watched, watched_t with MovieGraph_test use User def neighborhood_size = 20 def rating_type = :binary def k_recommendations = 10 def ratings = watched_t def approach = :item_based def G = undirected_graph[watched] with rel:graphlib[G] use jaccard_similarity def sim(m1, score, m2) = score = jaccard_similarity[m1, m2] and Movie[m1] and Movie[m2] and m1 != m2 and score != 0 def item_scores = pred[approach, rating_type, neighborhood_size, ratings, sim, top_k] def top_k_recommendations(user, rank, movie) = bottom[k_recommendations, item_scores[user]](rank, score, movie) from score def predicted_ranking = top_k_recommendations def target_ranking = MovieGraph_test:watched def output = average[precision_at_k[k_recommendations, predicted_ranking[u], target_ranking[u]] <++ 0.0 for u where User[u]]
In a nutshell:
predrelation, which computes the predicted score for all (user, movie) pairs.
For the MovieLens100K dataset, the user-based method slightly outperforms the item-based method. For k =10, we achieve a best precision score of 33.6% when using the asymmetric cosine similarity metric.
The recall score for this similarity metric is 21.2%. This result is in accordance with the literature for the neighborhood-based class of algorithms.
The whole pipeline is implemented in our Relational Knowledge Graph System (RKGS), meaning that the data does not need to leave the database at any point. Once the pipeline is implemented, generating recommendations for users is as simple as querying the database to retrieve the top
K recommendations for a particular user
Our next post will explain how to use Rel to implement two graph-based algorithms that slightly outperform the results of the neighborhood based methods explained here.
Part 3 will describe how to use Rel to provide content-based and hybrid recommender systems that leverage additional information signals, such as the genre of a movie.
RelationalAI's declarative modeling language Rel can be a powerful tool for machine learning data preprocessing. It is concise, readable, and facilitates testing and debugging in development. Rel can significantly simplify your machine learning data pipeline.Read More
In this series of blog posts, we show how to implement neighborhood-based, graph-based, content-based, and hybrid recommender systems using RelationalAI’s declarative modeling language, Rel. The implementations of these algorithms demonstrate the efficiency of our Relational Knowledge Graph System (RKGS) for aggregating over paths on large and sparse graphs, computing similarities using our graph analytics library, and building compact, easy to read models, without needing to transfer data outside of our system.Read More
In our previous blog post, we explained how to model traditional neighborhood-based recommender systems in Rel. In what follows, we focus on modeling graph-based recommender systems.Read More