rai_banner.png

Community Detection and Group Recommendations using RelationalAI

Overview

In this notebook, we will review how to use RelationalAI to run community detection algorithms to identify communities of customers that eat together.

What you will learn

  • How to run all sorts of graph algorithms on our data, where it already lives, to discover new insights.
  • How to visualize graphs.
  • How thanks to native applications with Snowpark Container Services, we can do all of this within the Data Cloud!

Let's get started!

We start by importing all the Python packages and modules that we will need.

In [1]:
import random
import colorsys

import relationalai as rai
from relationalai.std import aggregates, rel
from relationalai.std.graphs import Graph

random.seed(123)

provider = rai.Provider()
In [ ]:
provider.sql("""
begin
    create schema if not exists RAI_DEMO.TASTYBYTES;

    create or replace stage RAI_DEMO.TASTYBYTES.TMP
    URL = 'azure://raidocs.blob.core.windows.net/demos';

    create or replace file format parquet type='PARQUET';

    create or replace table RAI_DEMO.TASTYBYTES.ORDERS
    as
    select
        $1:ORDER_ID::int as ORDER_ID,
        $1:TRUCK_ID::int as TRUCK_ID,
        $1:ORDER_TS::string as ORDER_TS,
        $1:ORDER_TS_SECONDS::int as ORDER_TS_SECONDS,
        $1:ORDER_DETAIL_ID::int as ORDER_DETAIL_ID,
        $1:LINE_NUMBER::int as LINE_NUMBER,
        $1:TRUCK_BRAND_NAME::string as TRUCK_BRAND_NAME,
        $1:MENU_TYPE::string as MENU_TYPE,
        $1:PRIMARY_CITY::string as PRIMARY_CITY,
        $1:REGION::string as REGION,
        $1:COUNTRY::string as COUNTRY,
        $1:FRANCHISE_FLAG::int as FRANCHISE_FLAG,
        $1:FRANCHISE_ID::int as FRANCHISE_ID,
        $1:FRANCHISEE_FIRST_NAME::string as FRANCHISEE_FIRST_NAME,
        $1:FRANCHISEE_LAST_NAME::string as FRANCHISEE_LAST_NAME,
        $1:LOCATION_ID::int as LOCATION_ID,
        $1:CUSTOMER_ID::double as CUSTOMER_ID,
        $1:FIRST_NAME::string as FIRST_NAME,
        $1:LAST_NAME::string as LAST_NAME,
        $1:E_MAIL::string as E_MAIL,
        $1:PHONE_NUMBER::string as PHONE_NUMBER,
        $1:CHILDREN_COUNT::string as CHILDREN_COUNT,
        $1:GENDER::string as GENDER,
        $1:MARITAL_STATUS::string as MARITAL_STATUS,
        $1:MENU_ITEM_ID::int as MENU_ITEM_ID,
        $1:MENU_ITEM_NAME::string as MENU_ITEM_NAME,
        $1:QUANTITY::int as QUANTITY,
        $1:UNIT_PRICE::double as UNIT_PRICE,
        $1:PRICE::double as PRICE,
        $1:ORDER_AMOUNT::double as ORDER_AMOUNT,
        $1:ORDER_TAX_AMOUNT::double as ORDER_TAX_AMOUNT,
        $1:ORDER_DISCOUNT_AMOUNT::double as ORDER_DISCOUNT_AMOUNT,
        $1:ORDER_TOTAL::double as ORDER_TOTAL
    from
        @rai_demo.tastybytes.tmp/loyalty_orders.parquet
    (FILE_FORMAT => parquet);

    drop stage if exists rai_demo.tastybytes.tmp;
end;
""")

Define Model in RelationalAI

Let's define our model object. Models represent collections of objects. Objects, like Python objects, have types and properties, which we will define in a bit.

In [ ]:
model = rai.Model("TastyBytes", ensure_change_tracking=True)

Referencing Snowflake Data

Due to RelationalAI's tight integration with Snowflake, we can refer to data that we are streaming to our RelationalAI database by simply referring to the source:
<my_database>.<my_schema>.<my_table>.

In [3]:
Record = model.Type("Record", source="RAI_DEMO.TASTYBYTES.ORDERS")

Let's run a small query to understand the shape of our data.

In [4]:
# Count number of rows in the data
with model.query() as select:
    record = Record()
    num_records = aggregates.count(record)
    result = select(num_records)

result
Out[4]:
result
1603564

Type Declarations

Now that we know how to refer to our data within RAI, let's define our Types. Types are collections of objects. You create then using the model.Type() method, which returns an instance of the Type class.

Our data contains a set of records which are associated with a Customer, who buys food at a specific Truck. We'll describe the other types, namely Transaction , Community and RelevantConnection in a little while, when we will actually be deriving this information.

In [5]:
Customer = model.Type("Customer")
Truck = model.Type("Trucks")
Transaction = model.Type("Transaction")
Community = model.Type("Community")
RelevantConnection = model.Type("RelevantConnection")

Weaving data into our Model

It's time to start weaving our data into our model. We do this by writing some rules. Rules describe objects in a model and the relationships between them. They are created using the model.rule() context manager.

Defining Customers

This rule adds an object to the Customers type for each occurrence of a unique customer_id. Let's run a query to see how many customers we have in our dataset.

In [6]:
# Define Customer Type
with model.rule():
    r = Record()
    Customer.add(customer_id=r.customer_id)

# Check total number of customers
with model.query() as select:
    customer = Customer()
    num_records = aggregates.count(customer)
    result = select(num_records)

result
Out[6]:
result
10613

Defining Trucks

We do the same as we just did with customers - but now for Trucks.

In [7]:
# Define Truck Type
with model.rule():
    r = Record()
    Truck.add(truck_id=r.truck_id)

# Check total number of trucks
with model.query() as select:
    truck = Truck()
    num_records = aggregates.count(truck)
    result = select(num_records)

result
Out[7]:
result
15

Defining Transactions

We pull the relevant data about a Transaction from the Records in the table.

In [8]:
with model.rule():
    r = Record()
    Transaction.add(
        customer_id=r.customer_id,
        order_id=r.order_id,
        truck_id=r.truck_id,
        order_ts=r.order_ts,
        order_ts_seconds=r.order_ts_seconds,
        location_id=r.location_id,
    )

Finding Connections

Identify pairs of transactions occurring within a 20-minute interval in the same truck.

In [9]:
with model.rule():
    t1 = Transaction()
    t2 = Transaction()

    t1.truck_id == t2.truck_id
    t1.customer_id != t2.customer_id
    rel.abs(t1.order_ts_seconds - t2.order_ts_seconds) <= 1200

    t1.connected.add(t2)
In [10]:
with model.query() as select:
    t = Transaction()
    num_records = aggregates.count(t.customer_id, t.order_ts, t.connected, t.connected.customer_id)
    result = select(num_records)

result
Out[10]:
result
1214311

Reducing connections to 'relevant' ones

To be able to assess the strength of a connection, we first need to know how often a specific connection occurred. We define a RelevantConnection as a connection where two customers have placed orders together more than once. Let's see how many relevant connections we have in our dataset.

If you do not see any output from the result query below, you most likely have the original TastyBytes dataset. Try lowering the filter line ( total_connection > 4 ) to a smaller number like 1 or 2. Reach out to RAI if you would like is to share the custom TastyBytes dataset that shows a richer customer social graph. We can use Snowflake's Data Sharing to quickly give you access.

In [11]:
with model.rule():
    t = Transaction()
    total_connections = aggregates.count(
        t, per=[t.customer_id, t.connected.customer_id]
    )
    total_connections > 4
    RelevantConnection.add(
        customer_1=Customer(customer_id=t.customer_id),
        customer_2=Customer(customer_id=t.connected.customer_id),
        total_connections=total_connections,
    )
In [12]:
# Get the total occurrences where pairs of customers coexisted together more than once
with model.query() as select:
    relevant_con = RelevantConnection()
    num_records = aggregates.count(relevant_con)
    result = select(num_records)

result
Out[12]:
result
9704

Community Detection

Let's build out our community graph. Our nodes are our customers. We use the relevant connections between customers, which we derived earlier, as the edges of our graph.

In [13]:
community_graph = Graph(model, undirected=True)

# Add edges to the graph between customers / Nodes will be added automatically
with model.rule():
    relevant_con = RelevantConnection()
    community_graph.Edge.add(
        relevant_con.customer_1,
        relevant_con.customer_2,
        weight=relevant_con.total_connections,
    )

Detecting communities using Louvain

Let's use the Louvain algorithm to detect communities within our graph.

In [14]:
with model.rule():
    customer = Customer()
    community_id = community_graph.compute.louvain(customer)
    customer.set(community_id=community_id)

    community_graph.Node(customer).set(
        community_id=community_id,
        customer_id=customer.customer_id
    )

Visualize Graph

In [15]:
def generate_random_colors(num_colors):
    """generate random color for each group in large graph."""
    colors = []
    for i in range(num_colors):
        hue = random.random()
        # adjust for better palette
        saturation = random.uniform(0.5, 1.0)
        lightness = random.uniform(0.4, 0.8)

        rgb_color = colorsys.hls_to_rgb(hue, lightness, saturation)
        colors.append('#%02x%02x%02x' % tuple(int(255 * x) for x in rgb_color))
    return colors
In [16]:
data = community_graph.fetch()

num_nodes = len(data["nodes"])
num_edges = len(data["edges"])
print(f"Number of nodes: {num_nodes}, Number of edges: {num_edges}")

community_set = {node['community_id'] for node in data['nodes'].values() if 'community_id' in node}
num_communities = len(community_set)
print(f"Number of communities: {num_communities}")

random_colors = generate_random_colors(num_communities)

community_colors = {}
for i, community in enumerate(community_set):
    community_colors[community] = random_colors[i]
Number of nodes: 3493, Number of edges: 4939
Number of communities: 555
In [17]:
community_graph.visualize(
    three=True,
    node_label_size_factor=1.9,
    use_links_force=True,
    node_hover_neighborhood=True,
    style={
    "node": {
        "color": lambda x : community_colors[x['community_id']],
        "hover": lambda x : f"{x['customer_id']}"
    },
    "edge": {
        "opacity": 0.8,
        "color": "#ccc",
        "hover": lambda x : x['weight']
    }
})
Out[17]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
x-positioning force
Strength
y-positioning force
Strength
z-positioning force
Strength
Centering force

Let's focus on a particular community.

In [18]:
def get_community_id_of_customer( customer_id ):
    mycommunity = None
    for i in data.values():
        for p in i.values():
            try:
                if p['customer_id'] == customer_id:
                    mycommunity = p['community_id']
            except:
                pass
    return mycommunity

community_id_of_interest = get_community_id_of_customer(84087)
community_id_of_interest
Out[18]:
117
In [19]:
community_colors_focus = {}
for i in range(1,len(community_colors)+1):
    community_colors_focus[i] = '#808080'

community_colors_focus[community_id_of_interest] = '#FF0000'

#
community_graph.visualize(
    three=True,
    node_label_size_factor=1.9,
    use_links_force=True,
    node_hover_neighborhood=True,
    style={
    "node": {
        "color": lambda x : community_colors_focus[x['community_id']],
        "size" : lambda x : 20 if x['community_id'] == community_id_of_interest else 2,
        "hover": lambda x : f"{x['customer_id']}"
    },
    "edge": {
        "opacity": 0.8,
        "color": "#ccc",

        "hover": lambda x : x['weight']
    }
})
Out[19]:
Details for selected element
General
App state
Display mode
Export
Data selection
Graph
Node label text
Node size
Minimum
Maximum
Edge size
Minimum
Maximum
Nodes
Visibility
Size
Scaling factor
Position
Drag behavior
Hover behavior
Node images
Visibility
Size
Scaling factor
Node labels
Visibility
Size
Scaling factor
Edges
Visibility
Size
Scaling factor
Form
Curvature
Hover behavior
Layout algorithm
Simulation
Many-body force
Strength
Theta
Min
Max
Links force
Distance
Strength
x-positioning force
Strength
y-positioning force
Strength
z-positioning force
Strength
Centering force

Brought to you by RelationalAI & Snowflake Native Applications!