Announcing LoCoV1 and the Latest M2-BERT Models

Jon Saad-Falcon, Dan Fu, Simran Arora

We are thrilled to announce a new long-context retrieval benchmark, LoCoV1, as well as our newest long-context M2-BERT models! After hearing back from the community where long-context retrieval could be most useful, we are building on our earlier LoCoV0 and M2-BERT models with the release of:

  • LoCoV1: An expanded long-context benchmark consisting of 12 tasks drawn from law, medicine, science, finance, corporate governance, government reports, and more. LoCoV1 tasks use real-world datasets spanning diverse domains, including Tau Scrolls, QASPER, LongBench, the Legal Case Reports corpus, and more.
  • M2-BERT-V2 Models: Using LoCoV1 and our original pretrained M2-BERT checkpoints, we fine-tuned a new set of M2-BERT models for 128, 2k, 8k, and 32k input tokens. We include these inference checkpoints on HuggingFace and Together. We also provide our code base for developing your own long-context retrieval encoders with M2-BERT!

LoCoV1: Gauging Long-Context Handling in Modern Retrieval Systems

In our earlier blog post, we asked for the use cases where long-context retrieval would be impactful. After many great conversations with researchers and practitioners, we are excited to present LoCoV1, which includes several new application domains such as law and programming.

We hoped to encompass a new set of naturalistic, domain-specific retrieval tasks that reflect real-world use cases for long-context queries and documents. LoCoV1 draws from several existing long-context benchmarks, including Tau Scrolls, LongBench, and QASPER, as well as several domain-specific datasets not originally intended for retrieval, like CourtListener, the Australian Legal Court Reports dataset, and the StackOverflow forum.

Each dataset was selected for two main properties:

  • Their longer, more complex formatting of queries and documents.
  • Their ability to gauge long-context handling by containing relevant information throughout their queries and documents.

We make the queries and documents for LoCoV1 available on HuggingFace!

For an overview of the dataset, please see the table below. We also provide additional dataset information in our ArXiv release, including:

  • Query and document examples (Table 11)
  • Distribution plots of the document lengths (Figure 5)
DatasetSourceDomain# of Train Queries# of Train Documents# of Test Queries# of Test DocumentsAvg. Query LengthAvg. Doc Length
SummScreenFDTau ScrollsScreenwriting3673367333833859030,792
Gov. ReportTau ScrollsGovernment17457174579729723,87155,280
QMSUMTau ScrollsCorporate Management12571622723543058,129
QASPER - Title to Full TextQASPERScience8888884164167122,315
QASPER - Abstract to Full TextQASPERScience88888841641693122,315
MultiFieldQALongBenchGeneral Domain12012030306229,465
2WikimQALongBenchGeneral Domain24024060606937,867
Passage RetrievalLongBenchGeneral Domain240240606084035,814
CourtListener - Plain TextCourtListenerLaw10000100002000200014648,190
CourtListener - HTMLCourtListenerLaw10000100002000200014647,028
Australian Legal Case ReportsAustralian Legal Case ReportLaw3094309477077014,98647,536
StackOverflowStackOverflowProgramming15991800540077417584,544

The Latest Set of M2-BERT Embedding Encoders

With our new LoCoV1 benchmark, we fine-tune a new set of M2-BERT encoders for a broader diversity of long-context domains, starting with our pretrained checkpoints capable of handling 128, 2048, 8192, and 32768 input tokens. We include our latest code here and our model releases on HuggingFace:

Results of New M2-BERT Encoders on LoCoV1

We compare our latest long-context M2-BERT models to the same state-of-the-art retrieval encoders from our original blog post: E5-Mistral, BGE, OpenAI Ada, Cohere, and VoyageAI. We also test additional retrieval baselines, including BM25, ColBERT, and LongColBERT.

For our long-context retrieval models, we implement the same fine-tuning procedure as our previous blog post: we fine-tuned M2-BERT-80M-2K, -8K, and -32K on the training sets of these tasks using the orthogonal loss. These are our M2-BERT-80M-2K-retrieval, -8K-retrieval, and -32K-retrieval models.

Model# ParamsMax. Seq. Len.LoCo Score
BGE-Large Zeroshot335M51256.9
BGE-Large Finetuned335M51265.0
e5-mistral 7b-instruct7.11B409673.0
Jina Embeds.137M819267.2
OpenAI AdaN/A819263.9
Voyage 001N/A409656.9
Cohere English-3.0N/A51259.0
BM25N/AN/A81.5
ColBERT110M51254.3
LongColBERT110MN/A68.0
M2-BERT 12880M12869.7
M2-BERT 204880M204881.4
M2-BERT 819280M812988.9
M2-BERT 3276880M3276895.2

We find that lexical approaches, such as BM25, perform particularly strongly on the LoCoV1 tasks, which is particularly impressive considering how computationally low-cost it is! We also find that our M2-BERT retrieval models can outperform much larger models on long-context tasks, surpassing fine-tuned models 4x larger and zero-shot neural models up to 85x larger!

As we continue to make progress in understanding where long-context benchmarks and models are useful, we are excited to see how neural, lexical, and hybrid retrieval encoders continue to improve long-context retrieval performance.

For our complete LoCoV1 results with task subscores, please see Table 13 in our ArXiv release.

Limitations

We also explored the efficacy of M2-BERT in a reranker setting on the MLDR retrieval dataset, which has 800 queries and 200,000 long-context documents. Using exact search, we found that M2-BERT performance suffered, similar to other embedding approaches like E5-Mistral, OpenAI Ada, and M3. However, we augmented M2-BERT by using it as a reranker on BM25 retrieval results. With this approach, we were able to improve long-context retrieval performance and achieve 80.9 nDCG@10 compared to current SOTA of 77.5 by LongColBERT. In future work, we would like to further explore how M2-BERT embeddings can be augmented or combined with lexical approaches to maximize both short and long document handling.

Acknowledgments

We would like to thank the Stanford Center for Research on Foundation Models (CRFM), and the Stanford AI Laboratory for supporting our research! Thank you to Alycia Lee for helping set up AutoModel for our models on HuggingFace.

We would also like to thank our collaborators at Together AI for their support in developing these models, and their help in hosting these models in their new embedding API. Thanks to Together, we were able to test out early versions of these models at a Hackathon with MongoDB, and we’re looking forward to all the new use cases for long-context retrieval models including RAG integrations with LangChain and LlamaIndex.

Check out their blog posts on new usage of these models!