Metadata Management Series: Compliance Reasoning with Knowledge Graphs

Data governance policies impose requirements over the assets – e.g., database tables, documents, and services – that make up an enterprise data landscape. Here are two examples that illustrate the nature of these requirements:

  • To support financial audits, attributes X, Y, and Z of any item that is bought or sold must be traceable from each transaction involving that item.
  • Sensitive data from multiple clients shall never commingle in the same asset.

Checking such requirements involves reasoning over multiple assets and flows of data among those assets. For enterprises with hundreds or thousands of assets, these checks are expensive and complicated.

Even worse, any change to any asset or flow in the landscape could potentially violate a requirement. Therefore, to remain compliant with policy, every update to the landscape should be examined to decide whether requirements that were satisfied before the change continue to be satisfied after or if some action must be taken. This brittleness under change is one of the main reasons why building and maintaining a data fabric is a highly manual and tedious process.

Data architects and governance officers try to manage this problem by building models of their data landscape. The models contain descriptions of the data in the assets, i.e., metadata, including information about data flows. Our first post in this series explained why managing and enforcing policy requirements at scale requires weaving two different kinds of metadata into a knowledge graph.

In this article we explain how such a knowledge graph automates:

  • checking requirements over the entire data landscape; and
  • maintaining these checks as data assets are changed or deleted.

To focus our explanation on the “nuts and bolts” of checks and maintenance, we drill down into a traceability example to show how the requirement is checked and how it is maintained as data assets are updated or deleted.

Example: Reasoning About a Traceability Requirement

Consider a financial services enterprise that records securities in an authoritative table called Securities, and stores trades in a table called Trades that is managed by a different SQL database:

CREATE TABLE Securities(
  ID uniqueidentifier PRIMARY KEY,
  Cusip varchar(16) UNIQUE NOT NULL,
  Xchg varchar(20) NOT NULL,

  TradeId varchar(32) PRIMARY KEY,
  TxnType varchar(8) NOT NULL,
  Qty decimal(16,4) NOT NULL,
  SecId varchar(32) NOT NULL,

Figure 1

Consider also a governance policy that mandates that the exchange on which a security is traded must be traceable from every trade.

Looking at the names and types of the columns declared in Trades, none appear to store anything about the exchange a security is traded on, but Securities declares a column called Xchg, and we know from the NOT NULL constraint on this column that every security named in that table records this information. If we knew that column actually records an exchange identifier, and if every security traded in Trades is represented in Securities, we could reason that every trade in Trades satisfies the traceability requirement.

Notice how our informal reasoning combines information about schema details with knowledge that enriches those details with semantics, e.g., “Securities.Xchg records identifiers for the exchange a security is traded on” and “every security traded in Trades is represented in Securities”. As we explained in our first article, schema details are discovered (or discoverable) metadata, while the semantics that enrich them are asserted metadata. By weaving these two forms of metadata into a knowledge graph, we can automate and continuously maintain the reasoning that a data architect performs informally and that is so critical maintaining compliance with policy.

The Structure of Knowledge Graphs and How to Reason About Them

Figure 2 depicts a snippet of the knowledge graph constructed from the schema in Figure 1. The graph comprises nodes and connections. Each node is either:

  • a value of some primitive data type, e.g., “Securities,” “ID,” etc, depicted as a dashed ellipse that encloses the value; or
  • some opaque[^footnote1] entity, depicted as a grey circle.

Each connection, depicted by a solid line segment, connects two or more nodes. In this example, each connection has exactly two endpoints, but that need not be the case in general.

Nodes in the graph classify under concepts, such as Table and Column, depicted by green rectangles. Dotted lines link concepts to the nodes they classify.

Figure 2
Figure 2

One way to reason over a graph is to query whether some node classifies under some concept. For instance, if n is the grey node at the top left of Figure 2, the graph will answer true when posed with the query “Table(n).” Were n either of the bottom-left nodes (i.e., the value “Xchg” or the grey circle connected to that value) the graph would answer false because neither of those nodes classifies under Table.

Just as concepts classify nodes, relationships classify connections. Figure 2 displays relationships as purple rectangles with dotted lines linking to the connections they classify. Suppose that t and c are nodes that represent the Trades table and SecId column respectively from Figure 1. The graph depicted above will answer the query “contains(t, c)” with true. Notice that when querying the classification of a connection, the nodes at each endpoint of a connection must be listed in some order. We refer to each endpoint (position) as a role and say the node connected at that endpoint plays that role. In this example, nodes t and c play the Table and Column roles respectively in a connection of the contains relationship. To clarify roles of connections in the diagram, we decorate each line segment with a “box” to denote the role that comes first in queries involving that connection.

Reasoning by Deriving Concepts and Relationships with Rules

Graphs can also answer with information other than true or false. The query:

c1, c2 : exists(t: contains(t, c1) and contains(t, c2) and c1 != c2 )

requests the maximal set of pairs (c1, c2) of nodes where each pair satisfies the formula that follows the ‘:’ separator. Judging from how c1 and c2 are used in that formula, both c1 and c2 will be bound to Column nodes. So the graph will answer with a set of all pairs of distinct Column nodes that are contained in the same table.

To answer this query, the graph reasons over every node t that plays the Table role[^footnote2] in some contains connection. Then for each pair of such connections, the graph extracts the nodes that play the Column role and adds them to the query’s answer. Suppose nodes i and x correspond to the ID and Xchg columns of the Securities table – i.e., the two grey nodes at the lower left of Figure 2. Then the answer to this query would include the pairs (i, x) and (x, i)[^footnote3].

By default, graphs do not retain the answers to queries, but sometimes we want to use a query to define some new concept or relationship in the graph, similar to how we use a query to define a derived view in a database. We instruct the graph to derive and maintain new concepts and relationships by installing rules.

Suppose we want to reason about the traceability of data stored in one column from that stored in another. Because required columns have non-null values in every row of a table, every column of a table is traceable to any required column of that table. The following rule derives the traceability of columns to required columns[^footnote4] in the same table, storing each pair as a new connection in a relationship called traces_to:

def traces_to(c1, c2) { // R1
  exists(t: contains(t, c1) and contains(t, c2) and
  RequiredColumn(c2) and c1 != c2)

Each connection verbalizes as “Column c1 traces to Column c2.” Notice that this rule excludes self connections.

Figure 3 depicts a snippet of the knowledge graph following the installation of this derived relationship (in blue outline) and some of the connections that it classifies[^footnote5] (also in blue).

Figure 3
Figure 3

This new relationship can be used in queries or other rules, and as connections are added or removed from the contains relationship, connections will automatically accrue or be retracted from traces_to according to the rule. In this way, we have added to the graph new reasoned knowledge that will be automatically maintained as metadata about the Securities or Trades tables change. Knowledge graph support for this “always on and continuously-maintained” reasoning is one of the main reasons they are useful for checking and maintaining policy requirements. Not only does the graph do the work to derive information using complex rules, it also knows when to repair the derived information following some change that might affect it.

As we discuss next, a derived concept or relationship may be defined by more than one rule, which is useful when a different rule expresses a different condition for finding connections that classify under the same relationship. Rule [R1] above defines traces_to connections among columns from the same table, but we might also define more complex connections that cross table boundaries. Such a rule would reason over asserted metadata, as we now explain.

Weaving in Asserted Metadata and Checking Traceability Policy

Recall from our informal reasoning about the traceability requirement, that we said, “every security traded by some trade in Trades is represented in Securities.” To use that assertion when reasoning, the graph must represent it explicitly, and it must be woven with any discovered metadata that it refers to. Figure 4 shows one way to record such an assertion and weave it in with the discovered metadata.

Figure 4
Figure 4

In this diagram, we use red outlines to differentiate those nodes and connections that represent asserted metadata and the concepts and relationships that are used to classify them as assertions. Notice a new concept called Flow, a new node that classifies under that concept, and two new relationships – called maps_source and maps_to_target – that connect Flow nodes to Column nodes. Connections of these new relationships weave asserted with discovered nodes.

Now let’s extend traces_to with a new rule that includes connections that reason from flows. The rule:

def traces_to(t, s) { // R2
  exists(f: maps_source(f, s) and maps_to_target(f, t) and

reasons that column t traces to s if there exists some flow f whose source is some required column s, and whose target is t. Once we install this rule, the knowledge graph will update to derive additional connections:

Figure 5
Figure 5

Notice the new traces_to connection labeled c1 in Figure 5.

Having defined [R2], we want any connections that it derives to combine with those derived by [R1] to form composite or transitive connections. The following rule derives these transitive connections by combining existing traces_to connections:

def traces_to(t, s) { // R3
  exists(i: traces_to(t, i) and traces_to(i, s) and t != s)

After adding this rule, the graph looks like this:

Figure 6
Figure 6

Notice how each of the new connections c2-c4 are derived using rule [R3] from connections derived by rules [R1] and [R2]. The connection labeled c4 establishes the traceability of the primary key of the Trades table to a column that (by its name at least) purports to hold exchange information. While we have yet to impose any business semantics on these two columns, a traces_to connection between them already gets us closer to being able to automatically check the traceability requirement.

Also, because the knowledge graph continuously maintains derived relationships, any update to the graph that impacts a connection like c4 will be immediately reflected. Suppose, for instance, the Securities table is modified to drop the Xchg column or remove its NOT NULL constraint. Dropping the column would remove the node from the graph, which would then remove connections c3 and c4. Likewise, modifying the schema to remove the NOT NULL constraint on that column would remove the traces_to connection from ID to Xchg, which would again result in the loss of connections c3 and c4. Either way, the loss of c4 disconnects Trades.TradeId from Securities.Xchg, thereby breaking the chain of reasoning that was used to demonstrate compliance with the traceability requirement.

Checking and Maintaining Requirements

Recall again that our informal reasoning about the traceability requirement relied on two additional assertions:

  • “Each row of Txn stores some trade”, and
  • “Securities.Xchg stores the exchange on which a security is traded”

These assertions refer to both discovered metadata and business concepts – trade and exchange – at which level the traceability requirement is specified. If we can weave these kinds of assertions into the graph then we can write a very natural query that checks the traceability requirement exactly and maintains it under changes to and deletions of assets to the data landscape.

Figure 7 shows how to do this. To reason with business concepts, we added each as a single node that represents that business concept distinct from its (possibly many) implementations in different data assets. We weave these business-concept nodes in with the discovered metadata using a new relationship called refines, whose connections map some Column to the business-concept node it refines.

Figure 7
Figure 7

For each business-concept node, we define a new concept to classify only that node[^footnote6].

Let’s now add support for checking and maintaining the requirement that every trade in a security is traceable to the exchange that trades in that security. The idea is to verify that every refinement of Trade traces to some refinement of Exchange. We can check this condition by posing the following query to the graph:

forall(t: Trade(t)
  exists(e, r1, r2: Exchange(e) and
    refines(r1, t) and refines(r2, e) and
    traces_to(r1, r2))

which if true, assures the data architect that her data landscape complies with the traceability requirement assuming she has identified every refinement of a Trade and connected it into the graph.

In this example, only one data asset refines Trade and only one asset refines Exchange, but the framework scales to more interesting data landscapes. The query above accommodates landscape models with more than one refines connection that target the Exchange and Trade nodes respectively. In a large investment bank, for instance, there could be 100 different tables that store trades – one per fund – each using a different table with a different schema. Each would have some means to identify trades, which the architect would assert refines the lone Trade node. If even one of these refinements does not trace to some column that refines the Exchange node, then the graph will answer this query with false.

She could also install a rule that looks for refinements of Trade that fail to trace to some refinement of Exchange:

def ExchangeTraceabilityViolator(r) {
  Trade(t) and refines(r, t) and
  not exists(e, r2: Exchange(e) and
  refines(r2, e) and
  traces_to(r1, r2))

These violators demonstrate instances of policy non-compliance that should be investigated.

Adding New Assets

This article demonstrated the details of weaving metadata into a knowledge graph to automate checking and maintenance of complex policy requirements. Because our focus was on the mechanics, we glossed over some important concerns in our example. For instance, not every refinement of a business concept will be a single column of a single table. What happens when what you want to weave in spans multiple columns, multiple tables, or manifests in different columns in different tables?

We also did not discuss how to use the knowledge graph to notify the data architect of the need to add new asserted metadata when some new asset like a totally new table that manages security trades gets added to the landscape. Nor did we discuss how a knowledge graph that weaves asserted with discovered metadata can be used to automate the suggestion of new assertions when new assets are added to the landscape.

The next articles in this series will tackle these issues in more detail and explain how to design asserted concepts and relationships to prevent these problems from happening.

[^footnote1]: Entity nodes have no values or additional structure “inside” them.

[^footnote2]: The first role in each contains connection must be played by a Table node.

[^footnote3]: Without the condition “c1 != c2” the result set would include connections like (i, i) and (x, x). More complex queries can be strung together using additional nesting of logical quantifiers (forall and exists) and logical connectives (and, or, and not).

[^footnote4]: For brevity, we elide the rule that defines RequiredColumn, but it classifies columns that are adorned by a NOT NULL or PRIMARY KEY constraint in their schema.

[^footnote5]: For brevity, we depict symmetric pairs of connections – e.g., (i, x) and also (x, i) – as one line segment rather than a pair of line segments with boxes at alternating ends.

[^footnote6]: This makes it possible to easily reference the business concept node in a query by “Exchange(e)” or “Trade(t)” as appropriate.