Skip to content

Agentic Profiling

The ipw run command profiles multi-turn agent workloads. Unlike single-turn profiling (ipw profile), agentic profiling tracks the full lifecycle of an agent solving a task: multiple LLM calls, tool invocations, and reasoning steps.

Command Reference

ipw run --agent <agent> --model <model> --dataset <dataset> [options]

Required Options

Option Description
--agent Agent harness ID (react, openhands, terminus)
--model Model name for the agent's LLM backbone
--dataset Dataset ID for the workload

Optional Options

Option Default Description
--max-queries all Limit number of tasks to run
--output-dir ./runs/ Directory for results
--max-turns 20 Maximum agent turns per task
--eval-client openai Client for evaluation judging
--eval-model gpt-5-nano-2025-08-07 Model for evaluation

Lifecycle

An agentic profiling run follows this sequence:

  1. Start energy monitor -- Launch the Rust gRPC telemetry service.
  2. Initialize agent -- Create the agent instance with the specified model and any MCP tools. Attach an EventRecorder for per-action telemetry.
  3. Iterate dataset -- For each dataset record:
    • Create a QueryTrace to track per-turn telemetry
    • Call agent.run(input) which internally:
      • Makes LLM inference calls (recorded as lm_inference_start/end events)
      • Calls tools (recorded as tool_call_start/end events)
      • Accumulates token counts, latencies, and tool metadata per turn
    • Capture the telemetry window for energy attribution
    • Build a TurnTrace for each agent turn with tokens, tools, latency, energy, and cost
    • Save the QueryTrace as a JSONL line
  4. Export results -- Write traces as JSONL and HuggingFace Arrow datasets.
  5. Run analysis -- Evaluate responses with the LLM judge and compute accuracy/efficiency metrics.

Agent Setup

ReAct Agent

The ReAct agent uses the Agno framework for tool-augmented reasoning:

uv pip install -e 'intelligence-per-watt[react]'

ipw run \
  --agent react \
  --model gpt-4o \
  --dataset gaia \
  --max-queries 10

The ReAct agent wraps tools with telemetry instrumentation. Each tool call is recorded as a tool_call_start/tool_call_end event pair, enabling per-tool energy attribution.

OpenHands Agent

The OpenHands agent uses the OpenHands SDK for autonomous task execution:

uv pip install -e 'intelligence-per-watt[openhands]'

ipw run \
  --agent openhands \
  --model gpt-4o \
  --dataset swebench \
  --max-turns 30

OpenHands integrates via an instrumented callback system. The _instrumented_callback method emits telemetry events for every action and observation in the agent loop. If the agent exhausts its turn budget without calling FinishTool, IPW sends a synthesis nudge to extract a final answer.

Terminus Agent

The Terminus agent runs tasks inside Docker containers for terminal/CLI benchmarking:

uv pip install -e 'intelligence-per-watt[terminus]'

ipw run \
  --agent terminus \
  --model gpt-4o \
  --dataset terminalbench \
  --max-queries 10

The Terminus agent:

  1. Creates or reuses a Docker container
  2. Installs tmux inside the container
  3. Runs the agent's perform_task() in a tmux session
  4. Captures the terminal output as the response

Tool Configuration

Agents can use MCP (Model Context Protocol) tools for accessing inference servers and retrieval systems. IPW ships with several built-in MCP servers:

Inference Server Tools

These connect to LLM inference servers for sub-queries:

  • openai_server -- OpenAI API
  • anthropic_server -- Anthropic API
  • gemini_server -- Google Gemini API
  • ollama_server -- Local Ollama
  • vllm_server -- Local vLLM
  • openrouter_server -- OpenRouter API

Retrieval Tools

These provide document retrieval capabilities:

  • bm25_server -- BM25 sparse retrieval
  • dense_server -- Dense vector retrieval
  • grep_server -- Grep-based text search
  • hybrid_server -- Hybrid BM25 + dense retrieval

All MCP tool servers are in ipw/agents/mcp/ and implement the BaseMCPServer interface.

Per-Turn Traces

Each agent turn produces a TurnTrace (defined in ipw/execution/trace.py) containing:

Field Type Description
turn_index int Sequential turn number
input_tokens int Tokens consumed in this turn
output_tokens int Tokens generated in this turn
tool_result_tokens int Tokens from tool results
tools_called list[str] Names of tools invoked
tool_latencies_s dict[str, float] Per-tool wall-clock time
wall_clock_s float Total wall-clock time for the turn
gpu_energy_joules float GPU energy consumed
cpu_energy_joules float CPU energy consumed
gpu_power_avg_watts float Average GPU power
cpu_power_avg_watts float Average CPU power
cost_usd float API cost for this turn
error str Error message if the turn failed

Turns are aggregated into a QueryTrace for the full task, with computed properties for total_input_tokens, total_output_tokens, total_tool_calls, total_gpu_energy_joules, and total_cost_usd.

Output Structure

Agentic runs produce both JSONL traces and Arrow datasets:

runs/run_<agent>_<model>_<dataset>/
    traces.jsonl               # One QueryTrace per line
    data-*.arrow               # HuggingFace dataset format
    summary.json               # Run metadata
    analysis/
        accuracy.json          # Scoring results

See Export Formats for schema details.