Datasets¶
IPW includes 10+ benchmark datasets spanning single-turn knowledge tests, multi-turn agentic tasks, and coding challenges.
Available Datasets¶
| ID | Type | Size | Evaluation | Description |
|---|---|---|---|---|
ipw |
Single-turn | 1,000 | LLM judge | Mixed knowledge workload (bundled, no download) |
mmlu-pro |
Single-turn | ~12,000 | MCQ exact match | Professional-level multiple-choice |
supergpqa |
Single-turn | varies | MCQ exact match | Graduate-level academic questions |
gpqa |
Single-turn | varies | MCQ exact match | Graduate-level science questions |
math500 |
Single-turn | 500 | Exact match | Mathematical problem solving |
natural-reasoning |
Single-turn | varies | LLM judge | Natural language reasoning |
wildchat |
Single-turn | varies | LLM judge | Real user conversations |
gaia |
Agentic | varies | LLM judge | Multi-step factual questions with file attachments |
simpleqa |
Agentic | varies | LLM judge | Short-form factual QA |
frames |
Agentic | varies | LLM judge | Multi-hop reasoning across Wikipedia articles |
hle |
Agentic | varies | LLM judge | Expert-level academic questions (Humanity's Last Exam) |
terminalbench |
Agentic | varies | Task-specific | Terminal/CLI task completion |
terminalbench-native |
Agentic | varies | Task-specific | TerminalBench with per-task Docker lifecycle |
swebench |
Coding | 50/500 | Test execution | Real GitHub issues requiring patches |
swefficiency |
Coding | varies | Speedup measurement | Code performance optimization |
Single-Turn Datasets¶
IPW Mixed 1K (ipw) -- The default dataset. 1,000 curated prompts across diverse topics, bundled with the package (no download required). Used when --dataset is not specified.
MMLU-Pro (mmlu-pro) -- Enhanced MMLU with harder professional-level multiple-choice questions across 14+ academic subjects. Scored by letter-answer extraction (no LLM judge needed).
SuperGPQA (supergpqa) -- Graduate-level academic questions testing deep subject knowledge. MCQ format with exact-match scoring.
GPQA (gpqa) -- Graduate-level science questions (physics, chemistry, biology) curated by domain experts.
MATH-500 (math500) -- 500 math problems covering algebra, geometry, number theory, and calculus. Scored by exact match after normalization.
Natural Reasoning (natural-reasoning) -- Tests logical deduction, common sense, and causal reasoning with open-ended responses.
WildChat (wildchat) -- Real user conversations covering a wide range of topics and interaction styles.
Agentic Datasets¶
GAIA (gaia) -- General AI assistant benchmark with real-world questions spanning three difficulty levels. Many questions include file attachments (PDFs, images, spreadsheets) that agents must read. Files are cached at ~/.cache/gaia_benchmark/.
SimpleQA (simpleqa) -- Short factual questions testing parametric knowledge (dates, names, numbers). Can also be used single-turn with ipw profile.
FRAMES (frames) -- Multi-hop factual retrieval requiring synthesis across 2-15 Wikipedia articles. Particularly suited for agents with web search or retrieval tools.
HLE (hle) -- Humanity's Last Exam from CAIS. Expert-level questions where state-of-the-art models typically score below 10%. By default loads text-only samples.
TerminalBench (terminalbench) -- Terminal/CLI tasks paired with the Terminus agent. Requires Docker for container management.
TerminalBench Native (terminalbench-native) -- Decoupled TerminalBench integration with per-task Docker lifecycle. Works with any agent (not just Terminus). Supports concurrent execution with --concurrency.
Coding Datasets¶
SWE-bench (swebench) -- Real GitHub issues from popular Python repositories. The agent must produce a patch that fixes the issue and passes the test suite. Two variants: verified (500 tasks) and verified_mini (50 tasks, default).
SWEfficiency (swefficiency) -- Code performance optimization benchmark. Given a repository and workload, the agent must implement optimizations achieving a target speedup while keeping tests passing.
Evaluation Methods¶
| Method | Datasets | Requires API Key |
|---|---|---|
| LLM judge | ipw, natural-reasoning, wildchat, gaia, simpleqa, frames, hle | Yes (gpt-5-nano default) |
| MCQ exact match | mmlu-pro, supergpqa, gpqa | No |
| Exact match | math500 | No |
| Task-specific | terminalbench, terminalbench-native, swebench, swefficiency | No |
LLM judge scoring requires IPW_EVAL_API_KEY or OPENAI_API_KEY in your environment.
Usage¶
# Single-turn profiling
ipw profile --dataset mmlu-pro --client ollama --model llama3.2:1b \
--client-base-url http://localhost:11434
# Agentic benchmarking
ipw run --agent react --model gpt-4o --dataset gaia --max-queries 10
# Limit sample count
ipw profile --dataset supergpqa --max-queries 50 ...
# List available datasets
ipw list datasets