Skip to content

Intelligence Per Watt

Benchmarking Intelligence Efficiency of LM Inference.

Get Started View on GitHub


  • Profile


    Single-turn and agentic inference profiling with per-query telemetry across any OpenAI-compatible endpoint.

    Profiling guide

  • Measure


    Real-time energy, power, temperature, and memory telemetry via a Rust gRPC service sampling at 50ms.

    Benchmarking overview

  • Analyze


    Intelligence Per Joule and Intelligence Per Watt metrics with accuracy scoring, regression analysis, and plots.

    Analysis guide

  • Extend


    Plug in custom inference clients, benchmark datasets, agent harnesses, and platform collectors.

    Extending IPW


Key Metrics

  • Intelligence Per Joule (IPJ) = accuracy / average energy per query (joules)
  • Intelligence Per Watt (IPW) = accuracy / average power per query (watts)

What's Included

Component Options
Clients Ollama, vLLM, OpenAI-compatible (OpenAI, OpenRouter, Gemini, local servers)
Agents ReAct (Agno), OpenHands, Terminus
Datasets MMLU-Pro, GPQA, SuperGPQA, MATH-500, GAIA, SimpleQA, FRAMES, HLE, TerminalBench, SWE-bench, SWEfficiency
Telemetry NVIDIA (NVML), AMD (ROCm), Apple Silicon (powermetrics), Linux (RAPL)
Evaluation LLM judge, MCQ exact match, task-specific scoring

About

Built by Stanford Hazy Research and the Scaling Intelligence Lab.

Paper: arXiv:2511.07885

Acknowledgements

Stanford HAIIBM ResearchOllama