Tuesday, April 16, 2024

Five AI Trends from the latest NeurIPS Conference

The Neural Information Processing Systems (NeurIPS) 2023 conference, one of the largest AI gatherings, showcases the current trends in AI. This year's conference was no exception, with the dominance of Large Language Models (LLMs) on full display. While NeurIPS focuses on LLMs, it also indirectly highlights significant developments from other AI conferences. To validate the trends, we spent over 100 hours researching keynotes, tutorials, workshops, and oral presentations to come up with a summary that we feel represents the latest on the most dominant technologies of GenAI. Here's an overview of the top five AI trends:

1. Open-source models make LLMs accessible to all

Researchers aim to reduce the cost and the massive resource requirements for training LLMs, making them more accessible to academia and smaller organizations. Open-source LLM projects like LLAMA-2 and LLM-360 aim to democratize access to these models. LLAMA-2 is "free", meaning that their weights are publicly available but the training data and the training process are not released. This is equivalent to distributing the binary code of a program but not the source code. The training (or better the pretraining) of an LLM looks like an experiment and it has to be documented very thoroughly so that others can reproduce it. LLM-360 is a project in this direction and it is considered as a true open-source project.

2. Data challenges are being solved with smaller, deeper datasets and knowledge graphs

The conference extensively discussed the data issue, highlighting a consensus that high-quality data such as books and curated publications are running scarce. Additionally, the potential of alternative data sources like code and simulators, which operate as knowledge graphs (KGs) was explored. These can significantly enhance performance, particularly in tasks involving reasoning, planning, and social/emotional understanding. KGs offer deductive reasoning through rules that can model any environment with a web of logical rules that are reactive and recursive. Think about a huge excel spreadsheet. A KG models the dependencies between different entities and values as a graph. A KG can represent anything from differential equations to business rules.

Researchers recently proved that tiny-LLMs (a few millions of parameters) can have very good performance if they are trained with the right dataset. Because they are so small that they can be trained in a few hours with modest hardware resources. Experienced data scientists can now experiment with crafting the training data and getting a better understanding of how to prepare them. This is exactly what they used to do 10 years ago with feature engineering. The rise of tiny LLMs offers a promising way to distill datasets.

An emerging trade-off exists between using vast amounts of data to train huge models automatically versus reducing scale, which requires more investment in engineers and dataset curation but less on training. Let’s compare the cost which also reduces the carbon footprint which is of great ethical value for the researchers: in terms of carbon emissions a good data scientist has a much smaller footprint than a cluster of GPUs that can consume the electricity of a city for a month, just saying!

3. Software 3.0 paradigm: program smaller LLMs instead of training a huge LLM

The concept of Software 3.0 proposes the departure from training a single LLM towards building larger models by combining smaller, specialized LLMs. This idea draws parallels to traditional software development practices, where complex projects are built from smaller components. Karpathy's Software 2.0 paradigm, introduced in 2017, likened training deep learning models to compiling software. The approach suggests training numerous specialized LLMs and combining them using a task vector, similar to compiling separate source files in software development. A notable inspiration for this approach comes from Word2Vec (Test of Time award for NeurIPS 2023), which demonstrated the ability to represent words as numerical vectors, allowing algebraic operations on them to derive semantic relationships. For example you could take the vector representation of "King" and add the vector representation of "woman". That would result in the vector that represented the word “Queen”.

Similarly, recent findings show that LLMs can be combined by adding or subtracting their embeddings to perform tasks like translation or filtering toxic language. Let's say we trained LLM A to translate from english to greek and LLM B to translate from greek to italian. By adding the embeddings of A and B we get LLM C that can now translate from english to italian! One more example: we trained LLM A on customer service data and LLM B on a dataset that has a lot of toxic language. By subtracting B from A we get LLM C that does customer service but with less toxic language!

By training and combining hundreds of thousands smaller LLMs specialized in various tasks and indexed with a task vector for faster retrieval, Software 3.0 aims to achieve more versatile and efficient models, mirroring the modular approach of traditional software development.

4. LLMs plus KGs equals both the creative and the rule follower

Despite the recent advancements (ChatGPT scores in the 89th percentile in SAT) and other specialized models like MINERVA and LLEMA, it is impossible to move forward without theorem provers (TPs). TPs are systems for encoding programmatically proof tactics, axions, and premises, and can be used in domains such as legal, or any other domain that embodies reasoning. In that sense, modern KGs are general forms of theorem provers. What we saw in the conference is that TPs are the ultimate companion of LLMs when it comes to synthesizing new knowledge.

Here is how it works: extracting reasoning paths from unstructured text involves converting mathematical proofs expressed in free text and symbols into a structured "computer program" format. LLMs excel at this task, automating a process that was previously manual. This process essentially creates a KG for mathematics, capturing sequences of reasoning rather than just facts. Formal proofs extracted from this KG can be used to retrain LLMs, enabling them to reason and potentially prove new theorems.

By nature, a LLM is designed to be probabilistic, making it a creative system that can dream new reasoning paths. But in order to make valid outputs, we need the TPs ( a.k.a. KGs) to determine if the "dream" is valid or not.

5. Finally, LLMs can be pretrained on relational tables - meeting enterprises where their data sits

LLMs are trained on big text collections, but the majority of enterprise data sits on relational tables. A reasonable question is whether we can pretrain language models on relational tables.

When it comes to tables with numerical and categorical data, LLMs can be trained to do predictive tasks. LLMs trained on millions of tables can classify a feature vector instantly if we give it context with up to 10,000 training examples. This behavior can be generalized to multiple tables that contain facts. LLMs can ingest the tables and predict facts that do not exist in the original tables. This has been traditionally done with Graph Neural Network (GNNs), but transformers (designed for sequential data processing, particularly well-suited for natural language processing tasks) seem to be catching up.

The choice here is to either modify the architecture of general purpose LLMs that are trained mainly in text to accommodate text, or find a different way to train general purpose LLMs with tables represented in a text form.

View the full presentation: "NeurIPS 2023 Trends in AI".

About the Author

Nikolaos Vasiloglou is the VP of Research-ML at RelationalAI. He has spent his career on building ML software and leading data science projects in Retail, Online Advertising and Security. He is a member of the ICLR/ICML/NeurIPS/UAI/MLconf/KGC/IEEE S&P community, having served as an author, reviewer, and organizer of workshops and the main conference. Nikolaos is leading the research and strategic initiatives at the intersection of Large Language Models and Knowledge Graphs for RelationalAI.

About RelationalAI

RelationalAI is the industry's first AI coprocessor for data clouds and language models. Its groundbreaking relational knowledge graph system expands data clouds with integrated support for graph analytics, business rules, optimization, and other composite AI workloads, powering better business decisions. RelationalAI is cloud-native and built with the proven and trusted relational paradigm. These characteristics enable RelationalAI to seamlessly extend data clouds and empower you to implement intelligent applications with semantic layers on a data-centric foundation.