Skip to content

Adding Inference Clients

Inference clients are adapters that connect IPW to different LLM inference servers. To add a new client, subclass InferenceClient and register it with ClientRegistry.

Step 1: Create the Client File

Create a new file in intelligence-per-watt/src/ipw/clients/:

# ipw/clients/my_service.py
from __future__ import annotations

from typing import Any, Sequence

from ..core.registry import ClientRegistry
from ..core.types import ChatUsage, Response
from .base import InferenceClient


@ClientRegistry.register("my-service")
class MyServiceClient(InferenceClient):
    """Client for MyService inference API."""

    client_id = "my-service"
    client_name = "MyService"

    def __init__(self, base_url: str, **config: Any) -> None:
        super().__init__(base_url, **config)
        # Initialize your client library
        # Use lazy imports for optional dependencies
        try:
            import my_service_sdk
        except ImportError:
            raise ImportError(
                "my-service-sdk is required. Install with: pip install my-service-sdk"
            )
        self._client = my_service_sdk.Client(base_url=base_url)

    def stream_chat_completion(
        self, model: str, prompt: str, **params: Any
    ) -> Response:
        """Run a streamed chat completion."""
        import time

        request_start = time.time()
        first_token_time = None
        content_parts = []

        # Stream the response
        for chunk in self._client.stream(model=model, prompt=prompt):
            if first_token_time is None:
                first_token_time = time.time()
            content_parts.append(chunk.text)

        request_end = time.time()
        content = "".join(content_parts)

        # Build the response
        ttft_ms = (
            (first_token_time - request_start) * 1000
            if first_token_time
            else 0.0
        )

        return Response(
            content=content,
            usage=ChatUsage(
                prompt_tokens=chunk.usage.input_tokens,
                completion_tokens=chunk.usage.output_tokens,
                total_tokens=chunk.usage.total_tokens,
            ),
            time_to_first_token_ms=ttft_ms,
            first_token_time=first_token_time,
            request_start_time=request_start,
            request_end_time=request_end,
        )

    def list_models(self) -> Sequence[str]:
        """Return available models."""
        return [m.id for m in self._client.list_models()]

    def health(self) -> bool:
        """Check if the service is reachable."""
        try:
            self._client.health()
            return True
        except Exception:
            return False

Step 2: Register the Import

Add your module to the client initialization code so it gets imported when clients are registered. If your client has optional dependencies, wrap the import:

# In ipw/clients/__init__.py
try:
    from . import my_service  # noqa: F401
except ImportError:
    pass

Step 3: Add Optional Dependency

If your client requires an external library, add it as an optional dependency in pyproject.toml:

[project.optional-dependencies]
my-service = ["my-service-sdk>=1.0"]

Step 4: Test

# Install your extra
uv pip install -e 'intelligence-per-watt[my-service]'

# Verify it appears in the registry
ipw list clients

# Run a profile
ipw profile --client my-service --model my-model \
  --client-base-url http://localhost:8080

Key Requirements

stream_chat_completion()

This is the core method. Requirements:

  • Must stream: Consume the streaming API to measure time-to-first-token accurately.
  • Must return Response: Include content, token usage, and timing information.
  • Must record timestamps: request_start_time and request_end_time are used to window telemetry readings.
  • Must report tokens: ChatUsage is needed for FLOPs estimation and cost calculation.

list_models()

Return the list of model IDs available on the server. Used by ipw list clients for discovery.

health()

Return True if the server is reachable. Called before profiling starts to fail fast.

chat() (Optional)

The chat() method is used for evaluation (LLM judge). If your client can serve as an evaluation judge, implement it:

def chat(
    self,
    system_prompt: str,
    user_prompt: str,
    *,
    temperature: float | None = None,
    max_output_tokens: int | None = None,
) -> str:
    """Synchronous chat completion for evaluation."""
    response = self._client.chat(
        system=system_prompt,
        user=user_prompt,
        temperature=temperature,
    )
    return response.text

prepare() (Optional)

Called before the first query. Use it for model warmup:

def prepare(self, model: str) -> None:
    """Warm up the model."""
    self._client.load_model(model)

Existing Clients

Study these implementations for reference:

  • ipw/clients/ollama.py -- Ollama client with streaming
  • ipw/clients/vllm.py -- vLLM offline client
  • ipw/clients/openai.py -- OpenAI-compatible client (used for judge evaluation)