Like many people, I’m excited by foundation models, which feel different than last generation’s machine learning systems, and so when I say AI I just mean this new breed of foundation models with in-context learning and the ability to generate fluent text. One questions I think about is:
Is AI rare or everywhere?
I’m saying this in an intentionally annoying way, what I mean is: foundation models are remarkable. We’ve found a particular recipe to create these remarkable foundation models. Is this recipe necessary? What portions are necessary? Is this recipe robust to perturbations?
Robustness across Groups: An Anti-replication crisis? One could worry that only a few groups would be able to build foundation models due to the sheer number of details you have to get right. In contrast, almost all groups have very similar performance on an appropriate timeline. To be clear, there are leaders and laggards, but the gaps between them close relatively quickly! The open-source replicates and improves these models. This suggests that our current recipe is more robust than we had any reason to hope! It could have been worse, in that we’re pretty far from a replication crisis which has dogged other areas of AI and science more broadly. In contrast, we’re worried these are too easy to replicate–kind of amazing in this light!
Rare or many paths to AI? Another direction is interesting to me: how unique is the recipe? Maybe we got the cookbook right as a community, but there was really only one critical ingredient in the long list. Seems fun to figure out, and to me, it’s almost a philosophical question: is AI rare? What are the most basic conditions for this generation of AI to work? These are questions that feel pressing for research but I don’t hear discussion about them. Partly I think it’s hard to make these questions concrete. Our first attempt at this was: Do we need transformers? They have been the workhorse and viewed as critical to the recipe, but how special is this architecture for obtaining state-of-the-art AI performance?
Can you achieve foundation model performance with an entirely new recipe? Practically, transformers are amazing, but they do have limitations (quadratic runtime in sequence length, and a heuristic underpinning). But more interesting to me is: does another path exist? And could that other path build on established theoretical areas? The reason isn’t academic snobbery, it’s trying to relate to the previous work to get help! Can we tie in the results of a huge number of previous communities to make progress faster in our problem in AI?
The answer seems to be yes! Hyena shows that an entirely attention free architecture can succeed on language. This builds on a huge number of people – Albert Gu’s S4 work, David Romero’s CKConv, BlinkDL’s RWKV. We based it on signal processing, which is now led by Michael Poli and Stefano Massaroli. Conceptually, this is remarkable to me, it means we could have found this behavior through many means, and that’s in some sense shocking! My sense is that there will be dozens of simplifications of this architecture that will lead to a deeper understanding of AI and more widespread use. This is not a problem that requires chasing leaderboards, it requires depth of understanding. Academia can play a huge role in simplifying here!
Consequences? This feels like it has pretty profound consequences for our research trajectory. I haven’t worked out the consequences, and I wanted to write this to embarrass my grad students into telling me their ideas while they wonder what their advisor does all day...
- New Architectures Abound If AI is very common, why stick to the current architectures? Transformers are great, but they’re not the only choice – why not look for architectures that make hardware more energy efficient? Networking more efficient? What makes an architecture performant? Are there pieces of the architecture that enable specific behaviors but not others? If we can understand these questions, we could build new architectures with properties that we like!
- Sample Complexity What is the effect on sample complexity? Are there architectures that are orders of magnitude more sample efficient? Our scaling laws are pretty bad! Look at Llama that trains for 1 trillion tokens at 7 billion parameters.
- Test-Time Compute We’ve gone to an age of limited test-time inference, this was not how machine learning was a decade ago. Can we do richer inference at test time to obviate the need for such large models? What is the tradeoff? What can we do here?
- Trust and Safety Crudely: maybe AI is dangerous like gunpowder and not like nuclear weapons. That is, if it can be mass produced all over the globe, a very different regulatory idea may be required. For nuclear weapons, we restrict some key reagent like plutonium that may not work with AI.
- Openness AI has benefited from openness. Many of the models run the same underlying infrastructure (e.g., we’re biased in that we see Flash Attention by Tri Dao used in every one of these models), we all use basically the same tools. If AI is all around us, the big companies attempts to tamp it down may by “raising the drawbridges” around FMs may be wrong headed as it outbreeds them in simpler ways.
Again, who knows, I’m trying to understand why I’m so confused and writing helps me understand my own misconceptions. I appeal to a maxim for lectures: at least one person should enjoy a lecture, and it’s fine (and easiest to ensure) if it’s the lecturer.
Acknowledgements
Thanks to Michael Poli, Avanika Narayan, and Dan Fu for their comments and contributions to this post.