Identifying the Building Blocks of Attention in Deep Learning
With ChatGPT all over the news right now, many are focused on the applications of artificial intelligence (AI) and deep learning — including Pierre Baldi, long an expert in the field. Yet the Distinguished Professor of computer science in UC Irvine’s Donald Bren School of Information and Computer Sciences (ICS) is also closely examining the theoretical side of AI. His recently co-authored paper in the journal Artificial Intelligence, “The Quarks of Attention: Structure and Capacity of Neural Attention Building Blocks,” is one of the first theoretical papers on the topic of attention, which is playing a big role in today’s application of AI.
“Attention mechanisms are being used in something called ‘transformer architectures,’ which are at the heart of all current natural language processing (NLP) systems and large language models (LLMs), including GPT-3,” explains Baldi. These mechanisms, however, which are loosely inspired by the brain and its ability to shift our attention to different items, aren’t well understood. “We have all these systems, but we don’t exactly understand how they work, so there are a lot of interesting theoretical questions.”
One such questions is, “What are the building blocks of attention mechanisms within deep learning?” To answer this, Baldi partnered with Mathematics Professor Roman Vershynin of the UCI School of Physical Sciences.
Developing a Theory of Attention
“Our starting point was, what are these attention mechanisms?” asks Baldi. “We wanted to understand them from first principles, and then start building a mathematical theory, moving toward transformers.”
To better understand attention at the computational level, Baldi and Vershynin focused on artificial neural networks. As outlined in the paper, they “study [attention] within the simplified framework of artificial neural networks and deep learning by first identifying its most fundamental building blocks or quarks, using a physics-inspired terminology, and then rigorously analyzing some of their computational properties.” Using artificial neural networks helped the authors “avoid getting bogged down by the complexity of biological systems” and “provide a systematic treatment of attention mechanisms in deep learning.”
In the paper, among other things, they prove that attention mechanisms are computationally efficient.
The Mathematical Side of AI
This is not the first time that Baldi and Vershynin have collaborated on research. “I’ve actually worked with Roman for several years,” says Baldi, recalling how the two had gone to lunch when Vershynin was first considering coming to UCI. “I was telling him about deep learning and AI, and how exciting they are, and I suggested we do some mathematics research together.” Vershynin asked how that would work.
“If you come to UCI,” responded Baldi, “I’ll show you what the mathematical problems are!” That was more than five years ago.
“I’m very excited that we have this collaboration with the math department here,” says Baldi. “There are many areas of computer science where mathematics plays an essential role, but not many places have close collaboration between computer science and mathematics in the area of AI.”
— Shani Murray