Zoology (Blogpost 0): Overview

Simran Arora*, Michael Zhang*, Sabri Eyuboglu*.

Large language models have taken the world by storm, but their immense training and inference costs have catalyzed interest in improving their latency, throughput, and memory usage. These efforts fall into two broad categories: (1) making the de facto language modeling architecture, Transformers, more efficient (e.g. Flash Attention, PagedAttention, and Speculative Decoding) and (2) developing new architectures with better asymptotic scaling in sequence length than Transformers (e.g., Approximate Attention Methods, S4, Liquid S4, MEGA, GSS, H3, BIGS, Hyena, S5, RWKV, RetNet, Monarch Mixer, Mamba, and many more). 

Recent work suggests that some of these new architectures can match attention in quality, so we set out to better understand the innovations behind these recent architectures. We were initially motivated to apply the efficient architectures to long-sequence reasoning and high-throughput data management applications, which the three of us have cared about for a while. We asked whether switching from our familiar friend, the Transformer, would lead to any unexpected changes in the modeling behavior we’d see. During this exploration, we learned a lot about the strengths and weaknesses of the different classes of recent architectures. The lessons we learned motivated the development of Based, a new sub-quadratic architecture that addresses the limitations we observed in the existing efficient models. Language modeling is an expensive and popular investment across industries, so it’s important to have a solid understanding when deciding which architectures to use. We hope these blogposts help you better navigate the new wave of efficient models!

Overview of What We’re Sharing Today

We’re splitting the story into a two part blogpost. The first part covers our findings on the strengths and weaknesses of promising recent efficient architectures. The second part introduces our new architecture and evaluations. We’ve included training results for all models in this study in this WandB report if you’d like to follow along! 

Here we’ll give a brief overview of each blogpost, found at the following links:

Blogpost Part 1: Zoology Analysis. 

In our work, we benchmark a broad suite of popular sub-quadratic Transformer alternatives, collectively referred to as “gated convolutions” (e.g., Hyena, H3, RWKV). We find that, despite performing competitively on overall perplexity, these subquadratic architectures perform much worse than Transformers at a specific task called "associative recall" (AR). We find AR is responsible for >82% of the overall perplexity gap to attention.

Although this might sound like some esoteric task, AR has a long history in machine learning and prior work shows that the ability to solve these AR tasks is highly correlated with enticing capabilities like in-context learning. In our Zoology work, we further show where AR shows up in natural language distributions. Together, this AR quality gap casts potential doubts on how well these models may actually perform as Transformer replacements, once scaled up and evaluated beyond next-token prediction. 

We explain why the gap occurs, showing that gated convolution models actually need more dimensionality than attention to solve associative recall! This is clearly undesirable; is it actually the case that attention is what we need? Perhaps not! We shed light on what’s needed to close the gap to attention! 

Blogpost Part 2: Efficiently Closing the Associative Recall Gap with New Based Architectures. We develop Based, an architecture that efficiently solves associative recall and remains attention-free / sub-quadratic in sequence length! These models outperform strong attention baselines (Llama-style Transformers) in throughput and quality on the Pile! Based is designed with one goal in mind: what's the simplest model that gets us the quality we need? It’s built from familiar architectural primitives! We’re sharing a preview today, and are currently scaling up the Based architectures. As always, we’d love your feedback as we build on this work! You can also follow along with the preliminary quality results for our model compared to prior gated convolutions and the concurrent Mamba architecture in this WandB report