bench¶

Real agent + MCP testing. Define tasks in YAML, get objective metrics: token counts, costs, accuracy scores, timing.

Usage¶

bench [COMMAND] [OPTIONS]

Commands¶

run¶

Run benchmark tasks from a YAML file.

bench run FILE [OPTIONS]

Options¶

Option	Description
`--tui`	Interactive TUI for selecting benchmark files
`--csv`	Export results to CSV in `tmp/result-YYYYMMDD-HHMM.csv`
`-o, --output PATH`	Write results to YAML file
`--scenario NAME`	Run only scenarios matching NAME
`--task NAME`	Run only tasks matching NAME
`--tag TAG`	Run only tasks with matching tag
`-v, --verbose`	Show detailed per-call metrics
`--dry-run`	Validate config without making API calls
`--trace`	Show timestamped request/response cycle for debugging
`--no-color`	Disable colored output (for CI/CD compatibility)

Task Types¶

Type	Description
`type: direct`	Direct MCP tool invocation (no LLM)
`type: harness`	LLM benchmark with MCP servers (default)

Examples¶

# Run feature benchmarks
OT_CWD=demo bench run demo/bench/features.yaml

# Run specific tool benchmark
OT_CWD=demo bench run demo/bench/tool_brave_search.yaml

Benchmark File Structure¶

defaults:
  timeout: 60
  model: openai/gpt-5-mini

evaluators:
  accuracy:
    model: openai/gpt-5-mini
    prompt: |
      Evaluate this response.
      Response: {response}
      Return JSON: {"score": <0-100>, "reason": "<explanation>"}

scenarios:
  - name: Search Test
    tasks:
      - name: "search:base"
        server:              # No server = baseline
        evaluate: accuracy
        prompt: "Search for AI news"

      - name: "search:onetool"
        server: onetool
        evaluate: accuracy
        prompt: |
          __ot `brave.search(query="AI news")`

servers:
  onetool:
    type: stdio
    command: uv
    args: ["run", "onetool"]

Multi-Prompt Tasks¶

Use ---PROMPT--- delimiter to split a task into sequential prompts. Each prompt completes its agentic loop before the next begins, with conversation history accumulating.

- name: multi-step-task
  server: onetool
  prompt: |
    __ot
    ```python
    npm = package.version(registry="npm", packages={"express": "4.0.0"})
    ```
    Return the latest version.
    ---PROMPT---
    __ot
    ```python
    pypi = package.version(registry="pypi", packages={"httpx": "0.20.0"})
    ```
    Return the latest version.
    ---PROMPT---
    Summarize both versions as: "express: [version], httpx: [version]"

Configuration¶

File	Location	Purpose
`bench.yaml`	`.onetool/config/bench.yaml` or `~/.onetool/config/bench.yaml`	Benchmark harness config
`bench-secrets.yaml`	`.onetool/config/bench-secrets.yaml` or `~/.onetool/config/bench-secrets.yaml`	LLM API keys (OPENAI_API_KEY, etc.)

Resolution: BENCH_CONFIG env var → project → global

Note: Benchmark API keys (for running LLMs) go in bench-secrets.yaml, not secrets.yaml (which is for tool API keys).

Output¶

Benchmarks produce: - Token counts (input, output, total) - Cost estimates (USD) - Timing information - Evaluation scores

Demo Project¶

The demo/ folder provides sample configurations and data for testing.

Structure¶

demo/
  .onetool/
    onetool.yaml      # MCP server config
    bench.yaml        # Benchmark harness config
    prompts.yaml      # Prompt templates
  bench/              # Benchmark YAML files
  db/                 # Sample databases (northwind.db)

Running with Demo Config¶

# Run benchmarks
OT_CWD=demo bench run demo/bench/features.yaml

# Or use justfile
just demo::bench       # TUI picker for demo benchmarks

Benchmark Files¶

File	Description
`features.yaml`	Feature showcase (search, docs, transform)
`compare.yaml`	Compare base vs OneTool responses
`tool_*.yaml`	Per-tool benchmarks

Prompting Best Practices¶

Two rules prevent 90% of tool-calling problems:

system_prompt: |
  Never retry successful tool calls to get "better" results.
  If a tool call fails, report the error - do not compute the result yourself.

Why:

No retries on success: Agents sometimes want to "improve" results by calling the same tool again. This wastes tokens and can cause loops.
No manual computation on failure: When a tool fails, agents often try to compute the answer themselves. This defeats the purpose of using tools.

Batch Operations¶

For multiple related queries, use batch functions:

# Instead of multiple calls
brave.search(query="topic 1")
brave.search(query="topic 2")

# Use batch
brave.search_batch(queries=["topic 1", "topic 2"])

Token Efficiency¶

Use short prefixes (__ot) for simple calls
Use code fences for multi-step operations
Prefer batch operations over multiple calls