THIS POST IS ARCHIVED

Dovetail Join

At RelationalAI, we have built the first Relational Knowledge Graph System (RKGS).

One crucial part of our system is our dovetail join algorithm. This algorithm allows us to do a lot of complicated joins…. Extremely fast.

Why do you care about the Dovetail Join Algorithm? The dovetail algorithm is all about speed and it allows us to do more work, faster.

Dovetail Join requires fewer steps than traditional DBMS for complex queries. So it gets faster the harder or more complex the join is.

It also enables key technologies that allow us to speed up queries with high levels of optimization, memory reduction, and minimized code duplication.

Dovetail Join is a WCOJ (Worst Case Optimal Join) algorithm…. Which means we can mathematically prove that the more complicated the problem is, the faster we will go. Imagine working quality control on an assembly line. Instead of checking every single package or every other package, we skip most of the work and check every 100th package, without sacrificing any quality. To learn more about WCOJ, read the research paper here.

Our Dovetail Join is the latest in a lineage of radical new join algorithms pioneered by RAI researchers & published at top tier conferences such as SIGMOD, PODS, and ICDT.

Dovetail Join’s performance is what allows us to use a fully normalized data representation, Graph Normal Form, or GNF.

At RelationalAI, we believe that fully normalized relational data is the key to Knowledge Graph scalability and performance. By breaking down a wide table into several narrow tables, we save on space, and we enable semantic query optimization. For more details, see our shot on Graph Normal Form.

Semantic query optimization allows us to perform a number of transformations on our original query. Each transformation tries to refactor the query and reorder the way in which we calculate the answer. We find places to reuse work, which minimizes computation and speeds things up. For more details, see our RAI shot on semantic query optimization.

GNF enables semantic optimization, but having many narrow tables increases the number of joins. It may seem counterintuitive as to how we can go faster if we have to do more joins.

In order to understand how Dovetail Join works, let’s first consider what happens on a legacy database system.

Imagine we are joining 3 tables ( A, B and C ). Traditional relational DB systems join these tables in a sequential pairwise fashion. They join table A with table B, store the result in a temporary table, and then join that temp table with table C to get the results.

Notice that the temp table is often larger than the two input tables and it can be much larger. Not only that, but the temp table is larger than the results. You also have to store the intermediate table somewhere. This wastes memory, and for large joins the temporary table can be of non-trivial size.

Now imagine what this process looks like when you join 4 tables. Or 5 tables?! Typical analytics queries may require anywhere between 5 and 10 sequential table joins.

As a concrete example, let’s consider a triangle query and what it looks like to execute this query using a traditional pairwise join algorithm. Here we have 3 tables, R, S and T and would like to find which data is in all 3.

A traditional database begins processing this data by first joining tables R & S to create a temp table. Then it joins the temp table with T to find the answer.

It first creates the temp table by finding values in R that match values in S and emits the matching data.

It continues to do this for all of the data in the tables.

With the temp table created, it then performs a similar join of the temp table to the last table T to get our result. Notice that the temp table is larger than both the input & the output. Creating this table can require significant memory & time.

Now let’s consider our Dovetail Join. GNF tables each have a single value column. Instead of processing two tables at a time, we process all of the tables at once. If we are joining 3 tables or 30 this is true. Imagine how many temporary tables we avoid creating!

Because we are evaluating all of the tables at once, we avoid the need to create temp tables, which results in memory and time savings.

With Dovetail Join, we first iterate through all of the tables. We find where variables match, store a pointer to this location and iterate through the rest of the tables.

When there is a result, we write it to our output table.

Let’s walk through the rest of these tables. Instead of stopping after finding two tuples that join between the first two tables R & S, we just keep looking for matches to T.

This continues until we have processed all of the data.

You’ll notice that we never create a temporary table and we only write out the results. Most of the time a temporary table includes tuples that will never be in the resulting table. So that would be wasted work to generate the temp table in the first place.

This trick of rethinking how we join data allows us to process graph and analytic queries much faster than traditional systems. The Dovetail join algorithm makes GNF and large multi-way joins practical.