Coding Datasets¶
Coding datasets evaluate code generation, bug fixing, and performance optimization capabilities. These are designed for use with agent harnesses that can interact with code repositories.
SWE-bench¶
Real GitHub issues from popular Python repositories, requiring the model to produce a patch that fixes the issue and passes the test suite.
| Property | Value |
|---|---|
| Dataset ID | swebench |
| Source | princeton-nlp/SWE-bench_Verified |
| Evaluation | Test execution |
| Requirements | Agent with code editing capabilities |
Variants¶
SWE-bench ships with two variants:
| Variant | Source | Size | Description |
|---|---|---|---|
verified |
princeton-nlp/SWE-bench_Verified |
500 | Full verified dataset |
verified_mini |
MariusHobbhahn/swe-bench-verified-mini |
50 | Smaller subset for quick testing |
The default variant is verified_mini.
Usage¶
# Use the mini subset (default)
ipw run --agent openhands --model gpt-4o --dataset swebench --max-queries 10
# Full verified set (slow)
ipw run --agent openhands --model gpt-4o --dataset swebench
How It Works¶
- Each record contains a
problem_statement(the GitHub issue text) and metadata about the repository. - The agent reads the issue, explores the codebase, and generates a fix.
- The
answerfield contains the ground-truth patch for reference. - Correctness is determined by running the repository's test suite against the agent's patch.
Record Metadata¶
Each DatasetRecord.dataset_metadata includes:
| Field | Description |
|---|---|
instance_id |
Unique SWE-bench instance identifier |
repo |
GitHub repository (e.g., django/django) |
base_commit |
Commit hash for the issue |
hints_text |
Optional hints from the issue |
version |
Repository version |
test_patch |
Patch for the test that validates the fix |
fail_to_pass |
Tests that should go from failing to passing |
pass_to_pass |
Tests that should continue passing |
difficulty |
Estimated difficulty level |
variant |
Which SWE-bench variant this came from |
Evaluation¶
SWE-bench uses test execution rather than LLM judging. The evaluation checks whether the agent's patch causes the fail_to_pass tests to pass while keeping pass_to_pass tests green. This means eval_client, eval_base_url, and eval_model are set to None.
SWEfficiency¶
Code performance optimization benchmark. Given a repository and a workload, the goal is to produce optimizations that achieve a target speedup.
| Property | Value |
|---|---|
| Dataset ID | swefficiency |
| Source | HuggingFace |
| Evaluation | Speedup measurement |
| Requirements | Agent with code editing capabilities |
Usage¶
How It Works¶
- Each record describes a repository, a performance problem, a workload to benchmark, and an expected speedup target.
- The agent analyzes the codebase, identifies bottlenecks, and implements optimizations.
- The optimization is provided as a unified diff patch.
- Correctness is measured by whether the optimization achieves the target speedup while keeping tests passing.
Prompt Format¶
You are a software performance engineer. Your task is to optimize the code
in the repository to improve performance.
Repository: django/django
## Problem Statement
[Description of the performance issue]
## Workload Description
[How to benchmark the change]
## Expected Speedup
The optimization should achieve approximately 2.5x speedup.
## Instructions
1. Analyze the codebase to identify performance bottlenecks
2. Implement optimizations that improve performance
3. Ensure all existing tests still pass
4. Generate a git patch with your changes
Record Metadata¶
| Field | Description |
|---|---|
instance_id |
Unique identifier |
repo |
Repository name |
workload |
Workload description for benchmarking |
expected_speedup |
Target speedup multiplier |
base_commit |
Starting commit hash |
test_patch |
Test validation patch |
Choosing Between Coding Datasets¶
| Need | Dataset | Why |
|---|---|---|
| Bug fixing | swebench (verified_mini) |
Real bugs, test-validated, fast subset |
| Full evaluation | swebench (verified) |
Comprehensive 500-task benchmark |
| Performance optimization | swefficiency |
Tests optimization skill, not just correctness |