Evaluation Harness

To run the WONDERBREAD evaluations, please follow the instructions below.

1. Install WONDERBREAD from Github
git clone https://github.com/HazyResearch/wonderbread.git
cd wonderbread/
2. Create a conda environment and install dependencies
conda create -n wonderbread_env python=3.10 -y
conda activate wonderbread_env
pip3 install -r requirements.txt
pip3 install -e .

brew install ffmpeg
3. Download the dataset.

Note: the dataset is large and may take a while to download, so we download a debug subsample below for quick testing.

gdown 12iJoRZXyBV4pvEsWeAKv2n61LwVbUpqo # debug version of dataset
unzip debug_demos.zip && rm debug_demos.zip
mkdir -p data/demos && mv debug_demos/* data/demos && rm -r debug_demos/

To download the full dataset or the Gold subset, run the following commands:

gdown 193Mz_aMuVCXovT3fIwwZc9aH6if9PNjQ # gold version of dataset
gdown 1k-T-q1SI7rDu7pvqUPQ2w87OLf_IQrSv # full version of dataset
4. Run evaluations.

First, set the API keys for the models you want to test. You can get the API keys from the respective model providers.

# Set the API keys of the models you want to test
export OPENAI_API_KEY=<Your API Key>
export ANTHROPIC_API_KEY=<Your API Key>
export GOOGLE_API_KEY=<Your API Key>

To run the evals in debug mode (i.e. only 3 examples per task):

cd wonderbread/benchmark/tasks
export MODEL=GPT4 # currently supported: GPT4, Claude3, GeminiPro

# Documentation tasks
python3 documentation/sop_generation/run_experiments.py --model $MODEL --is_debug
python3 documentation/demo_segmentation/run_experiments.py --model $MODEL --is_debug

# Knowledge transfer tasks
python3 knowledge_transfer/demo_validation/run_experiments.py --model $MODEL --is_debug
python3 knowledge_transfer/question_answering/run_experiments.py --model $MODEL --is_debug

# Improvement tasks
python3 improvement/sop_improvement/run_experiments.py --model $MODEL --is_debug
python3 improvement/sop_ranking/run_experiments.py --model $MODEL --is_debug

To run the evals in gold mode (i.e. all examples in the Gold subset)

# Documentation tasks
python3 documentation/sop_generation/run_experiments.py --model $MODEL
python3 documentation/demo_segmentation/run_experiments.py --model $MODEL

# Knowledge transfer tasks
python3 knowledge_transfer/demo_validation/run_experiments.py --model $MODEL
python3 knowledge_transfer/question_answering/run_experiments.py --model $MODEL

# Improvement tasks
python3 improvement/sop_improvement/run_experiments.py --model $MODEL
python3 improvement/sop_ranking/run_experiments.py --model $MODEL