Stop Wasting Good Data: Reclaim Predictive Information With Knowledge Graphs
Have you ever said, "if I just had more data, I could make a better decision?" I can't count the number of times I've heard a data scientist say a variation of that when talking about improving predictions. The curious thing about that sentiment is that we usually have more data than we realize --- and the problem is actually related to whether you can access that information and use it.
A few weeks ago, I gave a lightning talk on this topic at the Future of Data-Centric AI event hosted by Snorkel. My goal was to encourage people to stop wasting perfectly good data they already have in their machine learning workflows. The talk quickly covered why this data isn't used and how a knowledge graph can help people reclaim more predictive information.
There are two fundamental ways to improve ML results: we either improve the data or encode domain knowledge in the process. In data-centric AI, we're often trying to increase the quality and richness of our data to get better results as opposed to just increasing the size of our dataset which doesn’t address garbage-in, garbage-out problems. Encoding knowledge ranges from simple data cleaning and sampling to more advanced feature engineering and weak supervision. There are many ways to capture knowledge.
Steam Rolled Data
So what's the first thing most data scientists do when we're getting ready to use all that data and knowledge to make a predictive model? We steamroll the data and force-fit it into ML systems that only handle flat structures. Think about all the organizational data we have where teams have put time and effort into figuring out dependencies, hierarchies, and weighted importance. Now we're flattening it all to fit into a feature matrix.
And not only do we remove that structural information, but most teams also work hard to toss out the relationships between entities themselves. When we translate data into a feature matrix and pluck out predictive elements (feature engineering), it’s often assumed that each data point in rows are independent of each other. But in the real world, that's precisely the opposite; things are highly related. For example, we can improve predictions about buying behavior when we understand relationships between things like previous purchases, browsing paths, social clubs, and reward programs. You don't want to lose this information.
In a machine learning model - we are fundamentally trying to approximate something in the real world to make predictions about it. So if we start by tossing out data and flattening its meaning - we're going to have a very shallow model that isn't capable of rich or broader predictions.
Low Hanging Fruit
Knowledge graphs are built by, and for the express purpose of, connecting data. That means knowledge graphs are pretty good at preserving relationships. When our data is connected via a knowledge graph, we can use graph feature engineering to translate complex relationships into a format ML systems can understand without losing significant meaning. Graph feature engineering encodes structural information, and you can use it right away alongside what you’re already doing in your ML pipeline.
Now let’s consider all of our business knowledge and how we might encode more of that logic as part of our models. A lot of people have been talking about semantic layers lately. It’s an old term related to mapping data to its business meaning. However, more recently, the word has been used in an extended way (sometimes as a ‘semantic model’) to focus equally on the greater logic surrounding the data and its meaning. (For example, not just that a number represents a discount, but also how it’s calculated, when it can be applied, and perhaps to include some regulatory rules.) People are using knowledge graphs today to bring this logic together with the data as a semantic layer or model to streamline the development of data apps. But it also enables us to encode this logic for use in ML workflows.
The Logical Leap Forward
So now, if we capture relationships, semantics, and logic together with our data as a knowledge graph, what next? The logical leap forward is to incorporate this knowledge into a system that can seamlessly perform computation and reasoning over it, thereby increasing its utility. That’s where a relational knowledge graph system comes in. It’s basically a model of concepts, their relationships, and the associated logic together in a model that can execute on the logic. Or, as my colleagues like to say, “the model is the program.”
So for those working on predictive models, this innovation means that not only does RelationalAI enable you to uncover and use predictive data, but it also enables you to dynamically compute new or updated information. And do this in a way that integrates with modern workstreams.
A lot of corporate data is on the Modern Data Stack in cloud data warehouses and data lakes. As a cloud-native, relational knowledge graph, RelationalAI can ingest that corporate data and bring it together with semantics and executable knowledge. With RelationalAI, data engineers and data scientists can streamline complex data pipelines, encode predictive relationships, and conduct reasoning workloads.
In my talk, I also quickly discussed the Snorkel Drybell project, where Google used a knowledge graph to add organizational resources into the Snorkel platform to improve predictions. They also used semantic categorization and business rules to capture domain expertise. A relational knowledge graph can bring together this type of highly predictive data and logic and feed it into models. There’s no reason to waste the valuable data you already have.