Collection Process
The dataset collection process is visualized in the figure below.

1) Workflow Sourcing
The workflows in WONDERBREAD are sourced from WebArena, a benchmark that includes 812 workflows mimicking real-world tasks on various websites like e-commerce platforms, content management systems, forums, and developer tools. From the original 812 workflows, we filtered down to 598 workflows by excluding workflows that were either impossible to complete or vaguely defined.
2) Recruitment
13 volunteers were recruited and trained to record demonstrations for the 598 workflows. The annotators attended a training session where they were introduced to the annotation pipeline and given detailed instructions on how to record demonstrations.
3) Data Collection
The annotators recorded a total of 3,202 demonstrations acrosss the 598 workflows. Each demonstration consiste of a full screen recording, action trace, and a standard operating procedure (SOP) outlining all of the steps taken in the demonstration.
4) Review
Using a combination of automated and manual checks, we reviewed the demonstrations to flag demonstrations of insufficient quality.
5) Quality Assurance
We had annotators re-record any demonstrations that were of insufficient quality. Overall, we conducted 3 cycles of re-recording to ensure clear and high-quality demonstrations.
6) Ranking
For a subset of 162 workflows, annotators reviewed all of the available demonstrations and provided a relative ranking of the demonstrations based on their quality.
7) Final Dataset
The final dataset contains a total 2,928 demonstrations across 598 workflows.