Datasets¶

IPW includes 10+ benchmark datasets spanning single-turn knowledge tests, multi-turn agentic tasks, and coding challenges.

Available Datasets¶

ID	Type	Size	Evaluation	Description
`ipw`	Single-turn	1,000	LLM judge	Mixed knowledge workload (bundled, no download)
`mmlu-pro`	Single-turn	~12,000	MCQ exact match	Professional-level multiple-choice
`supergpqa`	Single-turn	varies	MCQ exact match	Graduate-level academic questions
`gpqa`	Single-turn	varies	MCQ exact match	Graduate-level science questions
`math500`	Single-turn	500	Exact match	Mathematical problem solving
`natural-reasoning`	Single-turn	varies	LLM judge	Natural language reasoning
`wildchat`	Single-turn	varies	LLM judge	Real user conversations
`gaia`	Agentic	varies	LLM judge	Multi-step factual questions with file attachments
`simpleqa`	Agentic	varies	LLM judge	Short-form factual QA
`frames`	Agentic	varies	LLM judge	Multi-hop reasoning across Wikipedia articles
`hle`	Agentic	varies	LLM judge	Expert-level academic questions (Humanity's Last Exam)
`terminalbench`	Agentic	varies	Task-specific	Terminal/CLI task completion
`terminalbench-native`	Agentic	varies	Task-specific	TerminalBench with per-task Docker lifecycle
`swebench`	Coding	50/500	Test execution	Real GitHub issues requiring patches
`swefficiency`	Coding	varies	Speedup measurement	Code performance optimization

Single-Turn Datasets¶

IPW Mixed 1K (ipw) -- The default dataset. 1,000 curated prompts across diverse topics, bundled with the package (no download required). Used when --dataset is not specified.

MMLU-Pro (mmlu-pro) -- Enhanced MMLU with harder professional-level multiple-choice questions across 14+ academic subjects. Scored by letter-answer extraction (no LLM judge needed).

SuperGPQA (supergpqa) -- Graduate-level academic questions testing deep subject knowledge. MCQ format with exact-match scoring.

GPQA (gpqa) -- Graduate-level science questions (physics, chemistry, biology) curated by domain experts.

MATH-500 (math500) -- 500 math problems covering algebra, geometry, number theory, and calculus. Scored by exact match after normalization.

Natural Reasoning (natural-reasoning) -- Tests logical deduction, common sense, and causal reasoning with open-ended responses.

WildChat (wildchat) -- Real user conversations covering a wide range of topics and interaction styles.

Agentic Datasets¶

GAIA (gaia) -- General AI assistant benchmark with real-world questions spanning three difficulty levels. Many questions include file attachments (PDFs, images, spreadsheets) that agents must read. Files are cached at ~/.cache/gaia_benchmark/.

SimpleQA (simpleqa) -- Short factual questions testing parametric knowledge (dates, names, numbers). Can also be used single-turn with ipw profile.

FRAMES (frames) -- Multi-hop factual retrieval requiring synthesis across 2-15 Wikipedia articles. Particularly suited for agents with web search or retrieval tools.

HLE (hle) -- Humanity's Last Exam from CAIS. Expert-level questions where state-of-the-art models typically score below 10%. By default loads text-only samples.

TerminalBench (terminalbench) -- Terminal/CLI tasks paired with the Terminus agent. Requires Docker for container management.

TerminalBench Native (terminalbench-native) -- Decoupled TerminalBench integration with per-task Docker lifecycle. Works with any agent (not just Terminus). Supports concurrent execution with --concurrency.

Coding Datasets¶

SWE-bench (swebench) -- Real GitHub issues from popular Python repositories. The agent must produce a patch that fixes the issue and passes the test suite. Two variants: verified (500 tasks) and verified_mini (50 tasks, default).

SWEfficiency (swefficiency) -- Code performance optimization benchmark. Given a repository and workload, the agent must implement optimizations achieving a target speedup while keeping tests passing.

Evaluation Methods¶

Method	Datasets	Requires API Key
LLM judge	ipw, natural-reasoning, wildchat, gaia, simpleqa, frames, hle	Yes (`gpt-5-nano` default)
MCQ exact match	mmlu-pro, supergpqa, gpqa	No
Exact match	math500	No
Task-specific	terminalbench, terminalbench-native, swebench, swefficiency	No

LLM judge scoring requires IPW_EVAL_API_KEY or OPENAI_API_KEY in your environment.

Usage¶

# Single-turn profiling
ipw profile --dataset mmlu-pro --client ollama --model llama3.2:1b \
  --client-base-url http://localhost:11434

# Agentic benchmarking
ipw run --agent react --model gpt-4o --dataset gaia --max-queries 10

# Limit sample count
ipw profile --dataset supergpqa --max-queries 50 ...

# List available datasets
ipw list datasets