Adding Inference Clients¶
Inference clients are adapters that connect IPW to different LLM inference servers. To add a new client, subclass InferenceClient and register it with ClientRegistry.
Step 1: Create the Client File¶
Create a new file in intelligence-per-watt/src/ipw/clients/:
# ipw/clients/my_service.py
from __future__ import annotations
from typing import Any, Sequence
from ..core.registry import ClientRegistry
from ..core.types import ChatUsage, Response
from .base import InferenceClient
@ClientRegistry.register("my-service")
class MyServiceClient(InferenceClient):
"""Client for MyService inference API."""
client_id = "my-service"
client_name = "MyService"
def __init__(self, base_url: str, **config: Any) -> None:
super().__init__(base_url, **config)
# Initialize your client library
# Use lazy imports for optional dependencies
try:
import my_service_sdk
except ImportError:
raise ImportError(
"my-service-sdk is required. Install with: pip install my-service-sdk"
)
self._client = my_service_sdk.Client(base_url=base_url)
def stream_chat_completion(
self, model: str, prompt: str, **params: Any
) -> Response:
"""Run a streamed chat completion."""
import time
request_start = time.time()
first_token_time = None
content_parts = []
# Stream the response
for chunk in self._client.stream(model=model, prompt=prompt):
if first_token_time is None:
first_token_time = time.time()
content_parts.append(chunk.text)
request_end = time.time()
content = "".join(content_parts)
# Build the response
ttft_ms = (
(first_token_time - request_start) * 1000
if first_token_time
else 0.0
)
return Response(
content=content,
usage=ChatUsage(
prompt_tokens=chunk.usage.input_tokens,
completion_tokens=chunk.usage.output_tokens,
total_tokens=chunk.usage.total_tokens,
),
time_to_first_token_ms=ttft_ms,
first_token_time=first_token_time,
request_start_time=request_start,
request_end_time=request_end,
)
def list_models(self) -> Sequence[str]:
"""Return available models."""
return [m.id for m in self._client.list_models()]
def health(self) -> bool:
"""Check if the service is reachable."""
try:
self._client.health()
return True
except Exception:
return False
Step 2: Register the Import¶
Add your module to the client initialization code so it gets imported when clients are registered. If your client has optional dependencies, wrap the import:
Step 3: Add Optional Dependency¶
If your client requires an external library, add it as an optional dependency in pyproject.toml:
Step 4: Test¶
# Install your extra
uv pip install -e 'intelligence-per-watt[my-service]'
# Verify it appears in the registry
ipw list clients
# Run a profile
ipw profile --client my-service --model my-model \
--client-base-url http://localhost:8080
Key Requirements¶
stream_chat_completion()¶
This is the core method. Requirements:
- Must stream: Consume the streaming API to measure time-to-first-token accurately.
- Must return
Response: Include content, token usage, and timing information. - Must record timestamps:
request_start_timeandrequest_end_timeare used to window telemetry readings. - Must report tokens:
ChatUsageis needed for FLOPs estimation and cost calculation.
list_models()¶
Return the list of model IDs available on the server. Used by ipw list clients for discovery.
health()¶
Return True if the server is reachable. Called before profiling starts to fail fast.
chat() (Optional)¶
The chat() method is used for evaluation (LLM judge). If your client can serve as an evaluation judge, implement it:
def chat(
self,
system_prompt: str,
user_prompt: str,
*,
temperature: float | None = None,
max_output_tokens: int | None = None,
) -> str:
"""Synchronous chat completion for evaluation."""
response = self._client.chat(
system=system_prompt,
user=user_prompt,
temperature=temperature,
)
return response.text
prepare() (Optional)¶
Called before the first query. Use it for model warmup:
Existing Clients¶
Study these implementations for reference:
ipw/clients/ollama.py-- Ollama client with streamingipw/clients/vllm.py-- vLLM offline clientipw/clients/openai.py-- OpenAI-compatible client (used for judge evaluation)