May 20, 2024 · 5 min read

Announcing LoCoV1 and the Latest M2-BERT Models

We are thrilled to announce a new long-context retrieval benchmark, LoCoV1, as well as our newest long-context M2-BERT models! After hearing back from the community where long-context retrieval could be most useful, we are building on our earlier LoCoV0 and M2-BERT models with the release of:

LoCoV1: An expanded long-context benchmark consisting of 12 tasks drawn from law, medicine, science, finance, corporate governance, government reports, and more. LoCoV1 tasks use real-world datasets spanning diverse domains, including Tau Scrolls, QASPER, LongBench, the Legal Case Reports corpus, and more.
M2-BERT-V2 Models: Using LoCoV1 and our original pretrained M2-BERT checkpoints, we fine-tuned a new set of M2-BERT models for 128, 2k, 8k, and 32k input tokens. We include these inference checkpoints on HuggingFace and Together. We also provide our code base for developing your own long-context retrieval encoders with M2-BERT!

LoCoV1: Gauging Long-Context Handling in Modern Retrieval Systems

In our earlier blog post, we asked for the use cases where long-context retrieval would be impactful. After many great conversations with researchers and practitioners, we are excited to present LoCoV1, which includes several new application domains such as law and programming.

We hoped to encompass a new set of naturalistic, domain-specific retrieval tasks that reflect real-world use cases for long-context queries and documents. LoCoV1 draws from several existing long-context benchmarks, including Tau Scrolls, LongBench, and QASPER, as well as several domain-specific datasets not originally intended for retrieval, like CourtListener, the Australian Legal Court Reports dataset, and the StackOverflow forum.

Each dataset was selected for two main properties:

Their longer, more complex formatting of queries and documents.
Their ability to gauge long-context handling by containing relevant information throughout their queries and documents.

We make the queries and documents for LoCoV1 available on HuggingFace!

For an overview of the dataset, please see the table below. We also provide additional dataset information in our ArXiv release, including:

Query and document examples (Table 11)
Distribution plots of the document lengths (Figure 5)

Dataset	Source	Domain	# of Train Queries	# of Train Documents	# of Test Queries	# of Test Documents	Avg. Query Length	Avg. Doc Length
SummScreenFD	Tau Scrolls	Screenwriting	3673	3673	338	338	590	30,792
Gov. Report	Tau Scrolls	Government	17457	17457	972	972	3,871	55,280
QMSUM	Tau Scrolls	Corporate Management	1257	162	272	35	430	58,129
QASPER - Title to Full Text	QASPER	Science	888	888	416	416	71	22,315
QASPER - Abstract to Full Text	QASPER	Science	888	888	416	416	931	22,315
MultiFieldQA	LongBench	General Domain	120	120	30	30	62	29,465
2WikimQA	LongBench	General Domain	240	240	60	60	69	37,867
Passage Retrieval	LongBench	General Domain	240	240	60	60	840	35,814
CourtListener - Plain Text	CourtListener	Law	10000	10000	2000	2000	146	48,190
CourtListener - HTML	CourtListener	Law	10000	10000	2000	2000	146	47,028
Australian Legal Case Reports	Australian Legal Case Report	Law	3094	3094	770	770	14,986	47,536
StackOverflow	StackOverflow	Programming	1599	18005	400	7741	758	4,544

The Latest Set of M2-BERT Embedding Encoders

With our new LoCoV1 benchmark, we fine-tune a new set of M2-BERT encoders for a broader diversity of long-context domains, starting with our pretrained checkpoints capable of handling 128, 2048, 8192, and 32768 input tokens. We include our latest code here and our model releases on HuggingFace:

Results of New M2-BERT Encoders on LoCoV1

We compare our latest long-context M2-BERT models to the same state-of-the-art retrieval encoders from our original blog post: E5-Mistral, BGE, OpenAI Ada, Cohere, and VoyageAI. We also test additional retrieval baselines, including BM25, ColBERT, and LongColBERT.

For our long-context retrieval models, we implement the same fine-tuning procedure as our previous blog post: we fine-tuned M2-BERT-80M-2K, -8K, and -32K on the training sets of these tasks using the orthogonal loss. These are our M2-BERT-80M-2K-retrieval, -8K-retrieval, and -32K-retrieval models.

Model	# Params	Max. Seq. Len.	LoCo Score
BGE-Large Zeroshot	335M	512	56.9
BGE-Large Finetuned	335M	512	65.0
e5-mistral 7b-instruct	7.11B	4096	73.0
Jina Embeds.	137M	8192	67.2
OpenAI Ada	N/A	8192	63.9
Voyage 001	N/A	4096	56.9
Cohere English-3.0	N/A	512	59.0
BM25	N/A	N/A	81.5
ColBERT	110M	512	54.3
LongColBERT	110M	N/A	68.0
M2-BERT 128	80M	128	69.7
M2-BERT 2048	80M	2048	81.4
M2-BERT 8192	80M	8129	88.9
M2-BERT 32768	80M	32768	95.2

We find that lexical approaches, such as BM25, perform particularly strongly on the LoCoV1 tasks, which is particularly impressive considering how computationally low-cost it is! We also find that our M2-BERT retrieval models can outperform much larger models on long-context tasks, surpassing fine-tuned models 4x larger and zero-shot neural models up to 85x larger!

As we continue to make progress in understanding where long-context benchmarks and models are useful, we are excited to see how neural, lexical, and hybrid retrieval encoders continue to improve long-context retrieval performance.

For our complete LoCoV1 results with task subscores, please see Table 13 in our ArXiv release.

Limitations

We also explored the efficacy of M2-BERT in a reranker setting on the MLDR retrieval dataset, which has 800 queries and 200,000 long-context documents. Using exact search, we found that M2-BERT performance suffered, similar to other embedding approaches like E5-Mistral, OpenAI Ada, and M3. However, we augmented M2-BERT by using it as a reranker on BM25 retrieval results. With this approach, we were able to improve long-context retrieval performance and achieve 80.9 nDCG@10 compared to current SOTA of 77.5 by LongColBERT. In future work, we would like to further explore how M2-BERT embeddings can be augmented or combined with lexical approaches to maximize both short and long document handling.

Acknowledgments

We would like to thank the Stanford Center for Research on Foundation Models (CRFM), and the Stanford AI Laboratory for supporting our research! Thank you to Alycia Lee for helping set up AutoModel for our models on HuggingFace.

We would also like to thank our collaborators at Together AI for their support in developing these models, and their help in hosting these models in their new embedding API. Thanks to Together, we were able to test out early versions of these models at a Hackathon with MongoDB, and we’re looking forward to all the new use cases for long-context retrieval models including RAG integrations with LangChain and LlamaIndex.

Check out their blog posts on new usage of these models!