A collaboratively built large language model benchmark for legal reasoning

The LegalBench project is an ongoing open science effort to collaboratively curate tasks for evaluating legal reasoning in English large language models (LLMs). The benchmark currently consists of 162 tasks gathered from 40 contributors.

What is LegalBench?

LegalBench is a benchmark consisting of different legal reasoning tasks. Each task has an associated dataset, consisting of input-output pairs. Examples of tasks include:

  • The hearsay task, for which the input is a description of some evidence and the output is whether or not that evidence would be considered hearsay (i.e., “Yes” or “No”).
  • The proa task, for which the input is a statute, and the output is whether or not that statute contains a private right of action.
  • The Rule QA task, for which the input is a question about the substance of a law, and the output is the correct answer to the question.

Task datasets can be used to evaluate LLMs by providing the LLM with the input, and evaluating how frequently it generates the desired output. LegalBench tasks cover a wide range of textual types, task structures, legal domains, and difficulty levels. Descriptions of each task are available here.

Where do tasks come from?

Notably, LegalBench tasks have been assembled through a unique crowd-sourcing effort within the legal community. Individuals and organizations from a broad range of legal backgrounds—lawyers, computational legal practitioners, law professors, and legal impact labs—have contributed tasks they see as “interesting” or “useful.” Interesting tasks are those that require a type of reasoning that the contributor deemed to be worth measuring. For instance, the task might correspond to one that law students are frequently expected to perform as part of assessments. Useful tasks correspond to realistic potential applications of LLMs (e.g., contractual analysis). LegalBench has also been augmented with tasks derived from existing legal benchmarks (e.g., CUAD), thus providing a single platform for evaluating a wide range of legal tasks.

LegalBench is ongoing and we are always looking to incorporate more tasks. See here for more information on how to get involved!

How can LegalBench be used?

For AI researchers: LegalBench provides a collection of interesting and hard tasks. Researchers have long recognized the distinct technical challenges posed by legal tasks: complex jargon, longer contexts, sophisticated multi-step reasoning, intricate structure, and minimal labeled data. LegalBench offers AI researchers a new and relevant setting in which to evaluate existing LLMs.

For the legal community: LegalBench provides a means of assessing the performance of different LLMs for law-relevant tasks. As the legal community increasingly explores the potential applications and uses for LLMs, empirical assessments of LLM performance on domain-relevant tasks grow in importance. LegalBench provides legal professionals with guidance on the types of tasks for which current LLMs/prompting methods are proficient, and the types where significant technical improvements are necessary.

Citation information

Please include all citations below, which credit all sources LegalBench draws on.

      title={LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models}, 
      author={Neel Guha and Julian Nyarko and Daniel E. Ho and Christopher Ré and Adam Chilton and Aditya Narayana and Alex Chohlas-Wood and Austin Peters and Brandon Waldon and Daniel N. Rockmore and Diego Zambrano and Dmitry Talisman and Enam Hoque and Faiz Surani and Frank Fagan and Galit Sarfaty and Gregory M. Dickinson and Haggai Porat and Jason Hegland and Jessica Wu and Joe Nudell and Joel Niklaus and John Nay and Jonathan H. Choi and Kevin Tobia and Margaret Hagan and Megan Ma and Michael Livermore and Nikon Rasumov-Rahe and Nils Holzenberger and Noam Kolt and Peter Henderson and Sean Rehaag and Sharad Goel and Shang Gao and Spencer Williams and Sunny Gandhi and Tom Zur and Varun Iyer and Zehua Li},
  title={ContractNLI: A dataset for document-level natural language inference for contracts},
  author={Koreeda, Yuta and Manning, Christopher D},
  journal={arXiv preprint arXiv:2110.01799},
  title={Cuad: An expert-annotated nlp dataset for legal contract review},
  author={Hendrycks, Dan and Burns, Collin and Chen, Anya and Ball, Spencer},
  journal={arXiv preprint arXiv:2103.06268},
  title={MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding},
  author={Wang, Steven H and Scardigli, Antoine and Tang, Leonard and Chen, Wei and Levkin, Dimitry and Chen, Anya and Ball, Spencer and Woodside, Thomas and Zhang, Oliver and Hendrycks, Dan},
  journal={arXiv preprint arXiv:2301.00876},
  title={The creation and analysis of a website privacy policy corpus},
  author={Wilson, Shomir and Schaub, Florian and Dara, Aswarth Abhilash and Liu, Frederick and Cherivirala, Sushain and Leon, Pedro Giovanni and Andersen, Mads Schaarup and Zimmeck, Sebastian and Sathyendra, Kanthashree Mysore and Russell, N Cameron and others},
  booktitle={Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  title={When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings},
  author={Zheng, Lucia and Guha, Neel and Anderson, Brandon R and Henderson, Peter and Ho, Daniel E},
  booktitle={Proceedings of the eighteenth international conference on artificial intelligence and law},
  title={Maps: Scaling privacy compliance analysis to a million apps},
  author={Zimmeck, Sebastian and Story, Peter and Smullen, Daniel and Ravichander, Abhilasha and Wang, Ziqi and Reidenberg, Joel R and Russell, N Cameron and Sadeh, Norman},
  journal={Proc. Priv. Enhancing Tech.},
  title={Question answering for privacy policies: Combining computational and legal perspectives},
  author={Ravichander, Abhilasha and Black, Alan W and Wilson, Shomir and Norton, Thomas and Sadeh, Norman},
  journal={arXiv preprint arXiv:1911.00841},
  title={Factoring statutory reasoning as language understanding challenges},
  author={Holzenberger, Nils and Van Durme, Benjamin},
  journal={arXiv preprint arXiv:2105.07903},
  title={CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service},
  author={Lippi, Marco and Pa{\l}ka, Przemys{\l}aw and Contissa, Giuseppe and Lagioia, Francesca and Micklitz, Hans-Wolfgang and Sartor, Giovanni and Torroni, Paolo},
  journal={Artificial Intelligence and Law},


The following individuals/groups have contributed to LegalBench: