Collection Process

The dataset collection process is visualized in the figure below.

Descriptive Text

1) Workflow Sourcing

The workflows in WONDERBREAD are sourced from WebArena, a benchmark that includes 812 workflows mimicking real-world tasks on various websites like e-commerce platforms, content management systems, forums, and developer tools. From the original 812 workflows, we filtered down to 598 workflows by excluding workflows that were either impossible to complete or vaguely defined.

2) Recruitment

13 volunteers were recruited and trained to record demonstrations for the 598 workflows. The annotators attended a training session where they were introduced to the annotation pipeline and given detailed instructions on how to record demonstrations.

3) Data Collection

The annotators recorded a total of 3,202 demonstrations acrosss the 598 workflows. Each demonstration consiste of a full screen recording, action trace, and a standard operating procedure (SOP) outlining all of the steps taken in the demonstration.

4) Review

Using a combination of automated and manual checks, we reviewed the demonstrations to flag demonstrations of insufficient quality.

5) Quality Assurance

We had annotators re-record any demonstrations that were of insufficient quality. Overall, we conducted 3 cycles of re-recording to ensure clear and high-quality demonstrations.

6) Ranking

For a subset of 162 workflows, annotators reviewed all of the available demonstrations and provided a relative ranking of the demonstrations based on their quality.

7) Final Dataset

The final dataset contains a total 2,928 demonstrations across 598 workflows.