Skip to content

ot-bench

Measure what matters. Tokens, cost, accuracy.

Real LLM + MCP testing. Define tasks in YAML, get objective metrics: token counts, costs, accuracy scores, timing.

Usage

ot-bench [COMMAND] [OPTIONS]

Commands

run

Run benchmark tasks from a YAML file.

ot-bench run FILE [OPTIONS]

Options

Option Description
--tui Interactive TUI for selecting benchmark files
--csv Export results to CSV in tmp/result-YYYYMMDD-HHMM.csv
--scenario NAME Run only scenarios matching NAME
--task NAME Run only tasks matching NAME
--tag TAG Run only tasks with matching tag
--verbose Show detailed per-call metrics

Task Types

Type Description
type: direct Direct MCP tool invocation (no LLM)
type: harness LLM benchmark with MCP servers (default)

Examples

# Run feature benchmarks
OT_CWD=demo ot-bench run demo/bench/features.yaml

# Run specific tool benchmark
OT_CWD=demo ot-bench run demo/bench/tool_brave_search.yaml

Benchmark File Structure

defaults:
  timeout: 60
  model: openai/gpt-5-mini

evaluators:
  accuracy:
    model: openai/gpt-5-mini
    prompt: |
      Evaluate this response.
      Response: {response}
      Return JSON: {"score": <0-100>, "reason": "<explanation>"}

scenarios:
  - name: Search Test
    tasks:
      - name: "search:base"
        server:              # No server = baseline
        evaluate: accuracy
        prompt: "Search for AI news"

      - name: "search:onetool"
        server: onetool
        evaluate: accuracy
        prompt: |
          __ot `brave.search(query="AI news")`

servers:
  onetool:
    type: stdio
    command: uv
    args: ["run", "ot-serve"]

Multi-Prompt Tasks

Use ---PROMPT--- delimiter to split a task into sequential prompts. Each prompt completes its agentic loop before the next begins, with conversation history accumulating.

- name: multi-step-task
  server: onetool
  prompt: |
    __ot
    ```python
    npm = package.version(registry="npm", packages={"express": "4.0.0"})
    ```
    Return the latest version.
    ---PROMPT---
    __ot
    ```python
    pypi = package.version(registry="pypi", packages={"httpx": "0.20.0"})
    ```
    Return the latest version.
    ---PROMPT---
    Summarize both versions as: "express: [version], httpx: [version]"

Configuration

Configuration file: config/ot-bench.yaml or .onetool/ot-bench.yaml

Variable Description
OT_BENCH_CONFIG Config file path override

Output

Benchmarks produce: - Token counts (input, output, total) - Cost estimates (USD) - Timing information - Evaluation scores