Community Detection and Group Recommendations using RelationalAI
Overview
In this notebook, we will review how to use RelationalAI to run community detection algorithms to identify communities of customers that eat together.
What you will learn
- How to run all sorts of graph algorithms on our data, where it already lives, to discover new insights.
- How to visualize graphs.
- How thanks to native applications with Snowpark Container Services, we can do all of this within the Data Cloud!
Let's get started!
We start by importing all the Python packages and modules that we will need.
import random
import colorsys
import relationalai as rai
from relationalai.std import aggregates, rel
from relationalai.std.graphs import Graph
random.seed(123)
provider = rai.Provider()
provider.sql("""
begin
create schema if not exists RAI_DEMO.TASTYBYTES;
create or replace stage RAI_DEMO.TASTYBYTES.TMP
URL = 'azure://raidocs.blob.core.windows.net/demos';
create or replace file format parquet type='PARQUET';
create or replace table RAI_DEMO.TASTYBYTES.ORDERS
as
select
$1:ORDER_ID::int as ORDER_ID,
$1:TRUCK_ID::int as TRUCK_ID,
$1:ORDER_TS::string as ORDER_TS,
$1:ORDER_TS_SECONDS::int as ORDER_TS_SECONDS,
$1:ORDER_DETAIL_ID::int as ORDER_DETAIL_ID,
$1:LINE_NUMBER::int as LINE_NUMBER,
$1:TRUCK_BRAND_NAME::string as TRUCK_BRAND_NAME,
$1:MENU_TYPE::string as MENU_TYPE,
$1:PRIMARY_CITY::string as PRIMARY_CITY,
$1:REGION::string as REGION,
$1:COUNTRY::string as COUNTRY,
$1:FRANCHISE_FLAG::int as FRANCHISE_FLAG,
$1:FRANCHISE_ID::int as FRANCHISE_ID,
$1:FRANCHISEE_FIRST_NAME::string as FRANCHISEE_FIRST_NAME,
$1:FRANCHISEE_LAST_NAME::string as FRANCHISEE_LAST_NAME,
$1:LOCATION_ID::int as LOCATION_ID,
$1:CUSTOMER_ID::double as CUSTOMER_ID,
$1:FIRST_NAME::string as FIRST_NAME,
$1:LAST_NAME::string as LAST_NAME,
$1:E_MAIL::string as E_MAIL,
$1:PHONE_NUMBER::string as PHONE_NUMBER,
$1:CHILDREN_COUNT::string as CHILDREN_COUNT,
$1:GENDER::string as GENDER,
$1:MARITAL_STATUS::string as MARITAL_STATUS,
$1:MENU_ITEM_ID::int as MENU_ITEM_ID,
$1:MENU_ITEM_NAME::string as MENU_ITEM_NAME,
$1:QUANTITY::int as QUANTITY,
$1:UNIT_PRICE::double as UNIT_PRICE,
$1:PRICE::double as PRICE,
$1:ORDER_AMOUNT::double as ORDER_AMOUNT,
$1:ORDER_TAX_AMOUNT::double as ORDER_TAX_AMOUNT,
$1:ORDER_DISCOUNT_AMOUNT::double as ORDER_DISCOUNT_AMOUNT,
$1:ORDER_TOTAL::double as ORDER_TOTAL
from
@rai_demo.tastybytes.tmp/loyalty_orders.parquet
(FILE_FORMAT => parquet);
drop stage if exists rai_demo.tastybytes.tmp;
end;
""")
Define Model in RelationalAI
Let's define our model object. Models represent collections of objects. Objects, like Python objects, have types and properties, which we will define in a bit.
model = rai.Model("TastyBytes")
Referencing Snowflake Data
Due to RelationalAI's tight integration with Snowflake, we can refer to data that we are streaming to our RelationalAI database by simply referring to the source:
<my_database>.<my_schema>.<my_table>.
Record = model.Type("Record", source="RAI_DEMO.TASTYBYTES.ORDERS")
Let's run a small query to understand the shape of our data.
# Count number of rows in the data
with model.query() as select:
record = Record()
num_records = aggregates.count(record)
result = select(num_records)
result
result |
---|
1603564 |
Type Declarations
Now that we know how to refer to our data within RAI, let's define our Types. Types are collections of objects. You create then using the model.Type()
method, which returns an instance of the Type
class.
Our data contains a set of records which are associated with a Customer, who buys food at a specific Truck. We'll describe the other types, namely Transaction , Community and RelevantConnection in a little while, when we will actually be deriving this information.
Customer = model.Type("Customer")
Truck = model.Type("Trucks")
Transaction = model.Type("Transaction")
Community = model.Type("Community")
RelevantConnection = model.Type("RelevantConnection")
Weaving data into our Model
It's time to start weaving our data into our model. We do this by writing some rules. Rules describe objects in a model and the relationships between them. They are created using the model.rule()
context manager.
Defining Customers
This rule adds an object to the Customers
type for each occurrence of a unique customer_id
. Let's run a query to see how many customers we have in our dataset.
# Define Customer Type
with model.rule():
r = Record()
Customer.add(customer_id=r.customer_id)
# Check total number of customers
with model.query() as select:
customer = Customer()
num_records = aggregates.count(customer)
result = select(num_records)
result
result |
---|
10613 |
Defining Trucks
We do the same as we just did with customers - but now for Trucks
.
# Define Truck Type
with model.rule():
r = Record()
Truck.add(truck_id=r.truck_id)
# Check total number of trucks
with model.query() as select:
truck = Truck()
num_records = aggregates.count(truck)
result = select(num_records)
result
result |
---|
15 |
Defining Transactions
We pull the relevant data about a Transaction
from the Records
in the table.
with model.rule():
r = Record()
Transaction.add(
customer_id=r.customer_id,
order_id=r.order_id,
truck_id=r.truck_id,
order_ts=r.order_ts,
order_ts_seconds=r.order_ts_seconds,
location_id=r.location_id,
)
Finding Connections
Identify pairs of transactions occurring within a 20-minute interval in the same truck.
with model.rule():
t1 = Transaction()
t2 = Transaction()
t1.truck_id == t2.truck_id
t1.customer_id != t2.customer_id
rel.abs(t1.order_ts_seconds - t2.order_ts_seconds) <= 1200
t1.connected.add(t2)
with model.query() as select:
t = Transaction()
num_records = aggregates.count(t.customer_id, t.order_ts, t.connected, t.connected.customer_id)
result = select(num_records)
result
result |
---|
1214311 |
Reducing connections to 'relevant' ones
To be able to assess the strength of a connection, we first need to know how often a specific connection occurred. We define a RelevantConnection
as a connection where two customers have placed orders together more than once. Let's see how many relevant connections we have in our dataset.
If you do not see any output from the result query below, you most likely have the original TastyBytes dataset. Try lowering the filter line ( total_connection > 4 ) to a smaller number like 1 or 2. Reach out to RAI if you would like is to share the custom TastyBytes dataset that shows a richer customer social graph. We can use Snowflake's Data Sharing to quickly give you access.
with model.rule():
t = Transaction()
total_connections = aggregates.count(
t, per=[t.customer_id, t.connected.customer_id]
)
total_connections > 4
RelevantConnection.add(
customer_1=Customer(customer_id=t.customer_id),
customer_2=Customer(customer_id=t.connected.customer_id),
total_connections=total_connections,
)
# Get the total occurrences where pairs of customers coexisted together more than once
with model.query() as select:
relevant_con = RelevantConnection()
num_records = aggregates.count(relevant_con)
result = select(num_records)
result
result |
---|
9704 |
Community Detection
Let's build out our community graph. Our nodes are our customers. We use the relevant connections between customers, which we derived earlier, as the edges of our graph.
community_graph = Graph(model, undirected=True)
# Add edges to the graph between customers / Nodes will be added automatically
with model.rule():
relevant_con = RelevantConnection()
community_graph.Edge.add(
relevant_con.customer_1,
relevant_con.customer_2,
weight=relevant_con.total_connections,
)
Detecting communities using Louvain
Let's use the Louvain algorithm to detect communities within our graph.
with model.rule():
customer = Customer()
community_id = community_graph.compute.louvain(customer)
customer.set(community_id=community_id)
community_graph.Node(customer).set(
community_id=community_id,
customer_id=customer.customer_id
)
Visualize Graph
def generate_random_colors(num_colors):
"""generate random color for each group in large graph."""
colors = []
for i in range(num_colors):
hue = random.random()
# adjust for better palette
saturation = random.uniform(0.5, 1.0)
lightness = random.uniform(0.4, 0.8)
rgb_color = colorsys.hls_to_rgb(hue, lightness, saturation)
colors.append('#%02x%02x%02x' % tuple(int(255 * x) for x in rgb_color))
return colors
data = community_graph.fetch()
num_nodes = len(data["nodes"])
num_edges = len(data["edges"])
print(f"Number of nodes: {num_nodes}, Number of edges: {num_edges}")
community_set = {node['community_id'] for node in data['nodes'].values() if 'community_id' in node}
num_communities = len(community_set)
print(f"Number of communities: {num_communities}")
random_colors = generate_random_colors(num_communities)
community_colors = {}
for i, community in enumerate(community_set):
community_colors[community] = random_colors[i]
Number of nodes: 3493, Number of edges: 4939 Number of communities: 555
community_graph.visualize(
three=True,
node_label_size_factor=1.9,
use_links_force=True,
node_hover_neighborhood=True,
style={
"node": {
"color": lambda x : community_colors[x['community_id']],
"hover": lambda x : f"{x['customer_id']}"
},
"edge": {
"opacity": 0.8,
"color": "#ccc",
"hover": lambda x : x['weight']
}
})
Let's focus on a particular community.
def get_community_id_of_customer( customer_id ):
mycommunity = None
for i in data.values():
for p in i.values():
try:
if p['customer_id'] == customer_id:
mycommunity = p['community_id']
except:
pass
return mycommunity
community_id_of_interest = get_community_id_of_customer(84087)
community_id_of_interest
117
community_colors_focus = {}
for i in range(1,len(community_colors)+1):
community_colors_focus[i] = '#808080'
community_colors_focus[community_id_of_interest] = '#FF0000'
#
community_graph.visualize(
three=True,
node_label_size_factor=1.9,
use_links_force=True,
node_hover_neighborhood=True,
style={
"node": {
"color": lambda x : community_colors_focus[x['community_id']],
"size" : lambda x : 20 if x['community_id'] == community_id_of_interest else 2,
"hover": lambda x : f"{x['customer_id']}"
},
"edge": {
"opacity": 0.8,
"color": "#ccc",
"hover": lambda x : x['weight']
}
})
Brought to you by RelationalAI & Snowflake Native Applications!