Skip to content

Agentic Datasets

Agentic datasets require multi-turn reasoning, tool use, or retrieval. They are designed for use with ipw run and an agent harness.

GAIA

The GAIA benchmark tests general AI assistant capabilities with real-world questions that often require file reading, web search, or multi-step reasoning.

Property Value
Dataset ID gaia
Source gaia-benchmark/GAIA on HuggingFace
Split validation (default), subset 2023_all
Evaluation LLM judge (gaia handler)
Requirements IPW_EVAL_API_KEY or OPENAI_API_KEY

Features

  • Questions span three difficulty levels (Level 1, 2, 3)
  • Many questions include attached files (PDFs, images, spreadsheets) that the agent must read
  • File artifacts are downloaded and cached at ~/.cache/gaia_benchmark/
  • Prompts include file paths so agents with file-reading tools can access them

Usage

ipw run --agent react --model gpt-4o --dataset gaia --max-queries 10

Prompt Format

Questions are formatted with file context when applicable:

Please answer the question below. You should:
- Return only your answer...

The following file is referenced in the question below...
File name: document.pdf
File path: /home/user/.cache/gaia_benchmark/GAIA/2023/validation/document.pdf
Use the file reading tools to access this file.

Here is the question:
What is the total revenue mentioned in the attached report?

Metadata

Each record's dataset_metadata includes:

  • task_id -- unique GAIA task identifier
  • level -- difficulty level (1, 2, or 3)
  • file_name -- name of the attached file (if any)
  • file_path -- local path to the downloaded file

SimpleQA

Short-form factual QA from OpenAI's SimpleQA benchmark, testing parametric knowledge.

Property Value
Dataset ID simpleqa
Source basicv8vc/SimpleQA on HuggingFace
Split test (default)
Evaluation LLM judge (simpleqa handler)
Requirements IPW_EVAL_API_KEY or OPENAI_API_KEY

Features

  • Short factual questions with concise reference answers
  • Tests whether models can recall specific facts (dates, names, numbers)
  • Can be used for both single-turn and agentic profiling

Usage

# Single-turn
ipw profile --dataset simpleqa --client ollama --model llama3.2:1b \
  --client-base-url http://localhost:11434

# Agentic (with retrieval tools)
ipw run --agent react --model gpt-4o --dataset simpleqa --max-queries 50

Prompt Format

Please answer the question below with a short, factual response.
- Return only your answer, which should be a word, phrase, name, number, or date.
- If the answer is a number, return only the number without any units unless specified.

Question: Who was the first person to walk on the Moon?

FRAMES

Multi-hop factual retrieval from Google's FRAMES benchmark. Questions require synthesizing information across 2-15 Wikipedia articles.

Property Value
Dataset ID frames
Source google/frames-benchmark on HuggingFace
Split test (default)
Evaluation LLM judge (frames handler)
Requirements IPW_EVAL_API_KEY or OPENAI_API_KEY

Features

  • Questions require multi-hop reasoning across multiple Wikipedia sources
  • Some records include relevant Wikipedia context in the prompt
  • Tests retrieval-augmented generation (RAG) capabilities
  • Particularly suitable for agents with web search or retrieval tools

Usage

ipw run --agent react --model gpt-4o --dataset frames --max-queries 20

Prompt Format

Please answer the question below. You should:
- Return only your answer...
- This question may require multi-hop reasoning across multiple Wikipedia articles.

[Optional Wikipedia context]

Here is the question:
Which country has a higher population: the birthplace of Einstein or the birthplace of Newton?

HLE (Humanity's Last Exam)

Expert-level academic questions from the Center for AI Safety (CAIS). Designed to be the hardest publicly available benchmark.

Property Value
Dataset ID hle
Source cais/hle on HuggingFace
Split test (default)
Evaluation LLM judge (hle handler)
Requirements IPW_EVAL_API_KEY or OPENAI_API_KEY

Features

  • Expert-level questions across many academic disciplines
  • By default, only text-only samples are loaded (set text_only=False for multimodal items)
  • State-of-the-art models typically score below 10% on this benchmark
  • Useful for measuring the ceiling of model capabilities

Usage

ipw run --agent openhands --model gpt-4o --dataset hle --max-queries 10

Configuration

The HLEDataset constructor accepts:

  • split -- HuggingFace split (default: test)
  • max_samples -- limit number of samples
  • text_only -- if True (default), exclude samples with images

TerminalBench

Terminal/CLI task completion benchmark from the terminal-bench project.

Property Value
Dataset ID terminalbench
Source terminal-bench/terminal-bench on HuggingFace
Split test (default)
Evaluation Terminal output check (terminalbench handler)
Requirements terminal-bench package (ipw[terminus])

Features

  • Tasks that must be solved by executing terminal commands
  • Designed to pair with the Terminus agent and Docker containers
  • Evaluation checks the terminal state after task execution

Usage

ipw run --agent terminus --model gpt-4o --dataset terminalbench --max-queries 10

This requires Docker to be available for the Terminus agent's container management.