Single-Turn Datasets¶
Single-turn datasets provide one prompt per query. The model generates a single response, which is scored by an LLM judge or exact match. These are used with ipw profile.
IPW Mixed 1K¶
The default built-in dataset. A curated mix of 1,000 prompts across diverse topics, bundled with the package (no download required).
| Property | Value |
|---|---|
| Dataset ID | ipw |
| Source | Bundled in ipw/datasets/ipw/data/ |
| Size | 1,000 |
| Evaluation | LLM judge |
| Download | Not required |
ipw profile --dataset ipw --client ollama --model llama3.2:1b \
--client-base-url http://localhost:11434
This is the default dataset when --dataset is not specified.
MMLU-Pro¶
Professional-level multiple-choice questions spanning academic subjects. An enhanced version of MMLU with harder questions and more answer choices.
| Property | Value |
|---|---|
| Dataset ID | mmlu-pro |
| Source | TIGER-Lab/MMLU-Pro on HuggingFace |
| Size | ~12,000 |
| Evaluation | MCQ exact match |
| Split | test (default) |
ipw profile --dataset mmlu-pro --client ollama --model llama3.2:1b \
--client-base-url http://localhost:11434 --max-queries 100
Prompt format: Questions are presented with lettered options (A, B, C, ...) and the model is asked to respond with the correct letter.
Scoring: The evaluation handler (ipw/evaluation/mmlu_pro.py) extracts the letter answer from the model's response and compares it to the reference answer. No LLM judge is needed.
Subjects: biology, business, chemistry, computer science, economics, engineering, health, history, law, math, philosophy, physics, psychology, and more.
SuperGPQA¶
Graduate-level academic questions designed to test deep subject knowledge.
| Property | Value |
|---|---|
| Dataset ID | supergpqa |
| Source | HuggingFace |
| Evaluation | MCQ exact match |
| Split | test (default) |
ipw profile --dataset supergpqa --client vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--client-base-url http://localhost:8000
GPQA¶
Graduate-level science questions (physics, chemistry, biology) curated by domain experts.
| Property | Value |
|---|---|
| Dataset ID | gpqa |
| Source | HuggingFace |
| Evaluation | MCQ exact match |
MATH-500¶
500 mathematical problems covering algebra, geometry, number theory, and calculus.
| Property | Value |
|---|---|
| Dataset ID | math500 |
| Source | HuggingFace |
| Size | 500 |
| Evaluation | Exact match |
Natural Reasoning¶
Natural language reasoning problems that test logical deduction, common sense, and causal reasoning.
| Property | Value |
|---|---|
| Dataset ID | natural-reasoning |
| Source | HuggingFace |
| Evaluation | LLM judge |
WildChat¶
Real user conversations from the WildChat dataset, covering a wide range of topics and interaction styles.
| Property | Value |
|---|---|
| Dataset ID | wildchat |
| Source | HuggingFace |
| Evaluation | LLM judge |
Evaluation Methods¶
LLM Judge¶
For open-ended datasets (IPW, WildChat, Natural Reasoning), an LLM judge compares the model's response to the reference answer:
- Default judge:
gpt-5-nano-2025-08-07via OpenAI API - Requires
IPW_EVAL_API_KEYorOPENAI_API_KEYin your environment - The judge produces a binary correct/incorrect determination with metadata
MCQ Exact Match¶
For multiple-choice datasets (MMLU-Pro, SuperGPQA, GPQA), the evaluation handler extracts the selected letter from the response and compares to the correct answer. No API key is needed for scoring.
Exact Match¶
For datasets with precise answers (MATH-500), the response is compared directly to the reference answer after normalization.
API Key Requirements¶
Datasets that use LLM judge scoring require an evaluation API key:
MCQ datasets do not require an API key for scoring but still need one to run inference (unless using a local model).