Trends in Machine Learning: ICLR 2022

This was the 10th ICLR conference, marking the golden decade of deep learning and AI. Despite early predictions that the deep learning hype would be ephemeral, we are happy to see the field still growing while delivering maturity in algorithms and architectures. ICLR 2022 was full of exciting papers. Here at RelationalAI we spent ~100 hours going through the content as we believe it will drive the commercialization of AI in the following years. Here we present what we found to be the most noteworthy ideas.

Surprisingly we didn’t see a test of time paper award, which is usually given for an influential paper that was published 10 years ago. This is probably because deep learning has very rapid growth, and thus architectures and other algorithms have become obsolete in short periods of time. Deep learning is heavily based on empirical results where many assumptions need to be made. For that reason, papers usually do ablation studies to support the effectiveness of all components used. Despite these efforts, it is very common that different researchers arrive at conflicting conclusions, and we often see the claims from one group contested with counter experiments from other groups. As the area matures and we have more understanding of the fundamentals, we see fewer and fewer disagreements. This year we spotted two papers that arrived at opposing results on Graph Neural Networks (GNNs) for Question Answering [1, 2].

The convolutional network has been one of the most stable architectures, and has remained the fundamental block of vision networks since the 1998 LeNet paper. Very recently, we have seen papers presenting alternative blocks. Also, ResNet, which has been one of the most dominant architectures, seems to be slowly being replaced by vision transformers. Despite their success in NLP, transformers haven’t reached the level of performance of older architectures in vision and they are much more costly in training. We see papers that demonstrate how vision transformers can eventually beat other architectures at a low cost if they are trained with the right self-supervision techniques and the right loss function.

Every couple of years we see some papers that touch on the fundamentals of AI, such as gradient descent and sparsity, and which create new avenues that can make AI more efficient. The Sharpness Aware Minimization regularization seems to make training more stable and efficient and we see the first win in the vision transformer. The idea of mixing and shuffling the input is something we saw last year at the MLP-mixer paper and in the ICLR 2021 DeLight paper, and seems to be here to stay. It is very impressive though that the traditional fully connected layer is a die-hard one, as we see papers demonstrating better performance against newer architectures, such as convolutional layers and even transformers.

The lottery ticket hypothesis seems to be gaining maturity. It is a technique for sparsifying networks post-training that has been very fragile and unstable. We are starting to see the first signs of stabilizing. While post-training sparsification saves computations during inference, training still remains an energy-hungry process. The focus has now shifted towards pre-training sparsification of the network.

Something really notable is the increase in the number of papers in the area of drug design, protein folding, and other life sciences problems. This research has been sparked by the Covid-19 pandemic and by the recent advances by Deepmind. The papers focus on interesting problems which are simpler than the protein folding problem and require fewer computational resources.

Self-supervision dominates general interest and we see a lot of papers and interesting results. We pretty much know how to apply it to any type of data and any type of task. New self-supervision results in tough domains like theorem provers see a lift of 50% in performance [3]. Although self-supervision became popular through transformer-based language models, it is now an independent field and growing on its own.

Finally, language models remain the queen of the conference with several papers that try to improve the efficiency of the transformers by reducing their carbon footprint.

Question Answering stands as the best application of transformers, where we also found some interesting papers. QA works either on text or on knowledge graphs, or both. When it comes to knowledge graphs, research is focusing on reasoning, If we had to pick our favorite paper, we would pick the R5 paper in the area of QA over knowledge graphs.It is a beautiful soup of relational, reinforcement, recurrent, rules, and reasoning ideas (R5)!

At this year’s ICLR, we really enjoyed the blog track. It was very hard for us to pick a subset of them; we believe that all 20 of them are brilliant and very much worth reading. Compared to PDFs, blogs offer a much richer platform to express ideas, and most importantly they surface aspects of research in a horizontal way. It is a great opportunity to revisit groups of papers and delve into the details.

A few words about the methodology of creating this presentation. We scanned all the titles of all the 1095 accepted papers and we created a list of 155 papers that we thought were interesting. Our interests are biased toward language models, self supervision, sparsity, and application in sciences and fundamentals. We didn’t go in-depth in areas such as diffusion networks, reinforcement learning, and vision. We did pick papers that combine vision with transformers and reinforcement learning.

In the second phase, we grouped them by areas. In the third round, we went through the abstracts and shortlisted papers. In the last round, we picked the key points from papers that we found more groundbreaking, and most importantly which had good textual or visual summaries of the findings and the mechanics. We are happy to see that more authors put effort into communicating their message in a clear and concise manner, respecting the information overload of the readers.

If we felt that the paper was building on prior work that needed to be presented, we included an analysis of the cited paper. The preparation of this presentation was about 100 hours of work. Every conference generates so much content that we believe in the future it will be impossible to create a summary like this manually; it will have to be done algorithmically, based on papers published at the conference!

We hope you enjoy it. You can find a link to our slide presentation here, along with more detailed video analysis in part 1 and part 2.