Jun 8, 2025 · 12 min read

Cartridges: Storing long contexts in tiny caches with self-study

Sabri Eyuboglu, Ryan Ehrlich, Simran Arora

TL;DR When we put lots of text (e.g. a whole code repo) into a language model’s context, generation cost soars because of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe called self-study, we show that this simple idea can improve throughput by $26\times$ while maintaining quality. These smaller KV caches, which we call cartridges, can be trained once and reused for different user requests.

🐦Tweet
📄Paper
💻GitHub

Full team: Sabri Eyuboglu*, Ryan Ehrlich*, Simran Arora*, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, Chris Ré

Serving distinct long contexts for many different users is slow and expensive. This is due in large part to enormous per-user KV caches. In the literature, there are lots of techniques that reduce this memory consumption by making architectural modifications (e.g. linear attention) or applying KV cache compression methods (e.g. Duo Attention). But, most of them face tough quality-memory tradeoffs.

We were motivated by recent results in test time scaling and training to explore whether we can reduce memory consumption by training the KV cache on the long context offline. In other words, maybe we can trade off memory for offline compute, while maintaining quality.

Here’s the main idea: instead of creating a KV cache by running a single forward pass on the context, we train a smaller KV cache, which we call a cartridge, with gradient descent by back-propagating loss (e.g. next-token prediction) into the key and value vectors (equivalent to prefix tuning). Users often make many requests referencing the same context, so the cartridge can be trained once and then loaded and reused whenever needed.

To get this to work, we found that we couldn’t just train with next token prediction on the context. Instead, in a process we call self-study, we generate synthetic training data by having the model quiz itself about the context.

Cartridges trained with self-study deliver the quality of a regular KV cache while using $38.6\times$ less memory and enabling $26.4\times$ higher peak throughput when serving many users with different long contexts. We show they can also be used to extend the effective context length of a model 4x and can be combined without retraining.

Background

Frontier LLMs support extremely long context lengths. Gemini 2.5 ships with a 1 million token context window (link) and the Llama 4 supports an eincredible 10 million token context length (link). To put this in perspective, the entire NumPy Github repo is only 9.3 million tokens.

However, actually serving long-context language models for billions of users is extremely costly. Let’s assume, for example, that we want to deploy a chatbot using Llama 8B. If each of the users has 1024 tokens in context, with a single H100, we can output approximately 10,000 tokens per second across the users. But, if the users increase their context length to 128k tokens (e.g. by dumping a full code repository into the context), we can now only serve ~130 tokens per second across the users — a 77x reduction in throughput!

KV cache size vs. peak throughput when serving many users on an single H100.

It turns out that the slowdown can be traced almost entirely to a single factor: the size of the KV cache. When we process a long prompt with an autoregressive Transformer, we keep around the key and value vectors for the prompt tokens so we don’t have to recompute them for each new token we generate (here’s a blogpost from HuggingFace on the subject that explains the idea). The resulting object is enormous: the KV-cache representing a 128k token context for a single user consumes 84 GB of memory with Llama 70B! To put this in perspective, the entirety of the model’s parameters only consume about ~140GB.

Memory-quality tradeoffs

There are a number of different ways that folks try to limit the size of the KV-cache. Practitioners rely heavily on ad hoc techniques that sidestep putting the entire context into the model with clever use of chunking, summarization, and retrieval (think: tool use in an IDE like Cursor). On the research side, there’s a lot of great work exploring ways to compress the KV-cache (e.g. by dropping tokens or identifying structure in the keys and values). In our research, we’ve spent a lot of time trying to understand architectures (e.g. state space models and linear attention) which replace the KV-cache with something smaller.

We have found it hard to reduce the size of the KV-cache without trading something off. Most existing approaches trade off response quality. For example, in prior work, we showed that the reduced memory consumption of linear attention and state-space models comes with reduced performance on challenging tasks that require recalling details from the context.

The trouble is that for the long-context applications we discussed at the outset (e.g. asking questions about a patient’s full medical record), this isn’t a tradeoff we really want to make.

Trading memory for offline compute

Instead of trading quality for decreased memory consumption, maybe we can spend more FLOPs offline to reduce memory consumption instead. The approaches discussed above (cache compression and linear attention) spend the same or only slightly more FLOPs in constructing the KV-cache than we would in standard prefill. What if we constructed a KV-cache using orders-of-magnitude more FLOPs?

There’s a simple reason this tradeoff would be more desirable: the FLOP cost of generating a KV cache for one context (e.g. a codebase) can be amortized across the many user queries that reference it. In practice, users often issue many queries against the same context. Consider a team of engineers working on the same codebase or medical professionals referencing the same patient record. Moreover, because contexts are often static or change slowly, the KV-caches could be constructed offline when compute is cheap (in a form of sleep time compute).

Training a smaller KV-cache

Gradient descent is one FLOP-intensive algorithm that is tried and true: during pre-training we use it to stuff much of the internet into the parameters of Llama 70B (approx. 140 GB). Maybe we can use it to store a single code repository into a KV-cache smaller than 84GB. [1]

Prefix tuning gives us a natural way to apply gradient descent in this setting. Specifically, we are going to train a fixed-size KV cache by back-propagating a next token prediction loss into the key and value vectors.

After training, the key and value vectors contain information from the context. They can be trained once and then loaded back into a language model’s KV-cache whenever relevant. To us, this was reminiscent of popping games in and out of a retro game console, so we call KV-caches trained on a specific context cartridges. [2]

A cartridge is just a KV cache, so it can be plugged into existing LLM inference servers (e.g. VLLM) out of the box. These servers already handle per-user KV caches efficiently, so cartridges can be served at high throughput like any other prefix. Unlike LoRA, no special infrastructure is needed. [3]

The challenge of maintaining generality

When we first tried training a Cartridge with next token prediction like above, we saw something exciting: we could achieve close to zero loss on a 128k token context (a financial document) with a $125\times$ smaller cache. The cache had completely memorized the document — if we prompted it to complete an arbitrary passage from the document, it could!

Response quality, measured with log(perplexity), on diverse queries related to a single long financial document.

Too good to be true? You bet. We quickly realized that our cartridge lacked the generality we’ve come to expect from language models. A single KV cache —produced with standard prefill — can typically support diverse user interactions: from factual questions to writing poems about the context. This is one of the most mind-blowing capabilities of autoregressive Transformers. In contrast, our Cartridge could regurgitate the document, but do basically nothing else.

Perhaps this isn’t so surprising. LLM folk knowledge tells us that fine-tuning is best for specialization….

Improving generality with self-study

When we trained the cartridge with next-token prediction on the context, we ran many epochs over the context. This means we saw the same exact text over and over again, encouraging rote memorization.

Here is a simple idea: what if we instead train the cartridge on synthetic conversations about the context. This way, we never have to see the same exact text twice during training. To generate a synthetic conversation, we simply grab a chunk of the context and prompt the language model to start a conversation about it (e.g. by asking a question). We then prompt the same language model, also with the chunk in context, to respond.

Once the conversations are generated, we train on them using context-distillation (see Kujanpaa *et al.* and Snell *et al.).*

We call this training recipe — the combination of synthetic conversation generation and context-distillation — self-study. We got really excited when we trained a cartridge with self-study and saw strong performance on a very broad range of user interactions!

Pushing the memory-quality frontier!

Accuracy on the LongHealth benchmark for varying KV cache sizes.

Cartridges expand the memory-quality frontier dramatically. On LongHealth — a benchmark consisting of challenging questions that require reasoning about the whole patient record — a 0.96 GB cartridge achieved an accuracy of 55.1% outperforming the full ICL baseline. This represents a $13.8\times$ reduction in cache size.

Even more surprising to us was that a 52 MB cartridge ( $256\times$ compression and $121\times$ throughput increase!!!) achieved an accuracy of 47.7%. To put this in perspective, Duo Attention , a state of the art cache compression algorithm gets 43.0% accuracy at 6.77 GB (a 2x compression).

Context length extension!

Unlike long context models, self-study has no hard cap on context length! We demonstrated this on Machine-translation from one book (MTOB) a really fascinating benchmark in which the model must learn to translate from Kalamang — a low-resource language with almost no web presence — using only a grammar textbook on the language. Here, the full context is the 484k token textbook and an accompanying vocab list.

Performance on the MTOB benchmark for varying KV cache sizes.

We train a cartridge for Llama 3.1 8B using self-study on the textbook. Note that Llama 3.1 only has a context length of 128k tokens, but we can apply self-study to the full 484k token context. We achieve a 36.1 ChRF (a metric related to BLEU score) when using a 0.54 GB cache size.

While working on this project, Llama 4 came out, which supports context lengths of 10 million tokens. So, we also compare against Llama 4 Scout — a 109 billion parameter model with $13 \times$ more parameters than the model for which we trained the cartridge. Llama 4 Scout on the full 400k token textbook only achieves a ChRF of 36.3 (only 0.2 points better than our Cartridge on 8B)!

Looking forward

We’re really excited about this direction and there’s a lot left to do and understand.

Theoretical underpinnings. It isn’t obvious why Cartridges are so much more memory efficient than standard KV caches. In our paper, we provide a preliminary theoretical analysis on a simple variant of associative recall where keys and values are repeated in the sequence. We provide a separation result showing that gradient descent is able to solve the task with less memory consumption than attention or linear attention can. There is much more to unpack here, but this result is a first step towards understanding the expressive

Relationship to new architectures. Our theoretical analysis also points to connections between Cartridges and new architectures that incorporate gradient descent-like memory updates (e.g. Test-time training, Titans, Atlas, DeltaNet). In our work, we apply synthetic data generation to get around the data inefficiency of gradient descent. A different approach inspired by these architectures would be to meta-learn update rules that are more data efficient.

Speeding up training. In this paper, we did very little to optimize the speed with which cartridges are trained, so there’s lots left to do: techniques for improving data efficiency, better kernels for training Cartridges (e.g. FlashInfer-like kernels with a backward), and better initialization strategies.

Cumulative topics and online corpora. In our work, the conversations were generated independently of the cartridge. What if we generated conversations with the cartridge during training. Could this help on cumulative topics? More to come here 🙂

Reach out if you’d like to get involved!

Acknowledgements

There are tons of people and organizations who have supported this project. Below we shout out a few, but check out the the paper for a full list.

The compute for this project was provided by Modal — who made it super easy to scale out horizontally when running the synthetic data generation for self-study — and Together — who provided the compute for training the Cartridges on the synthetic data. Prime Intellect, Voltage Park, and Azure through the HAI Grants program also contributed compute towards this project.

Thank you to Ben Spector, Geoff Angus, Jordan Juravsky for helpful feedback on this blogpost.

Footnotes

[1] There are probably other ways to do this too!! For example, we could develop looped architectures that spend variable amounts of compute on prefill. But unlike architectural modifications, the techniques we explore can be used instantly with pretrained Transformers.

[2] The approach we described is essentially equivalent to prefix tuning. The original prefix-tuning paper was not focused on inference efficiency and does not mention the KV-cache. They study prefix-tuning as an alternative to full fine-tuning on task specific datasets. These differences were significant enough we felt they warranted a new name.

[3] In our paper, we explore other advantages of prefix tuning over LoRA — it turns out performance on queries unrelated to the context (e.g. MMLU) are far worse with LORA.