bench¶
Real agent + MCP testing. Define tasks in YAML, get objective metrics: token counts, costs, accuracy scores, timing.
Usage¶
Commands¶
run¶
Run benchmark tasks from a YAML file.
Options¶
| Option | Description |
|---|---|
--tui |
Interactive TUI for selecting benchmark files |
--csv |
Export results to CSV in tmp/result-YYYYMMDD-HHMM.csv |
-o, --output PATH |
Write results to YAML file |
--scenario NAME |
Run only scenarios matching NAME |
--task NAME |
Run only tasks matching NAME |
--tag TAG |
Run only tasks with matching tag |
-v, --verbose |
Show detailed per-call metrics |
--dry-run |
Validate config without making API calls |
--trace |
Show timestamped request/response cycle for debugging |
--no-color |
Disable colored output (for CI/CD compatibility) |
Task Types¶
| Type | Description |
|---|---|
type: direct |
Direct MCP tool invocation (no LLM) |
type: harness |
LLM benchmark with MCP servers (default) |
Examples¶
# Run feature benchmarks
OT_CWD=demo bench run demo/bench/features.yaml
# Run specific tool benchmark
OT_CWD=demo bench run demo/bench/tool_brave_search.yaml
Benchmark File Structure¶
defaults:
timeout: 60
model: openai/gpt-5-mini
evaluators:
accuracy:
model: openai/gpt-5-mini
prompt: |
Evaluate this response.
Response: {response}
Return JSON: {"score": <0-100>, "reason": "<explanation>"}
scenarios:
- name: Search Test
tasks:
- name: "search:base"
server: # No server = baseline
evaluate: accuracy
prompt: "Search for AI news"
- name: "search:onetool"
server: onetool
evaluate: accuracy
prompt: |
__ot `brave.search(query="AI news")`
servers:
onetool:
type: stdio
command: uv
args: ["run", "onetool"]
Multi-Prompt Tasks¶
Use ---PROMPT--- delimiter to split a task into sequential prompts. Each prompt completes its agentic loop before the next begins, with conversation history accumulating.
- name: multi-step-task
server: onetool
prompt: |
__ot
```python
npm = package.version(registry="npm", packages={"express": "4.0.0"})
```
Return the latest version.
---PROMPT---
__ot
```python
pypi = package.version(registry="pypi", packages={"httpx": "0.20.0"})
```
Return the latest version.
---PROMPT---
Summarize both versions as: "express: [version], httpx: [version]"
Configuration¶
| File | Location | Purpose |
|---|---|---|
bench.yaml |
.onetool/config/bench.yaml or ~/.onetool/config/bench.yaml |
Benchmark harness config |
bench-secrets.yaml |
.onetool/config/bench-secrets.yaml or ~/.onetool/config/bench-secrets.yaml |
LLM API keys (OPENAI_API_KEY, etc.) |
Resolution: BENCH_CONFIG env var → project → global
Note: Benchmark API keys (for running LLMs) go in bench-secrets.yaml, not secrets.yaml (which is for tool API keys).
Output¶
Benchmarks produce: - Token counts (input, output, total) - Cost estimates (USD) - Timing information - Evaluation scores
Demo Project¶
The demo/ folder provides sample configurations and data for testing.
Structure¶
demo/
.onetool/
onetool.yaml # MCP server config
bench.yaml # Benchmark harness config
prompts.yaml # Prompt templates
bench/ # Benchmark YAML files
db/ # Sample databases (northwind.db)
Running with Demo Config¶
# Run benchmarks
OT_CWD=demo bench run demo/bench/features.yaml
# Or use justfile
just demo::bench # TUI picker for demo benchmarks
Benchmark Files¶
| File | Description |
|---|---|
features.yaml |
Feature showcase (search, docs, transform) |
compare.yaml |
Compare base vs OneTool responses |
tool_*.yaml |
Per-tool benchmarks |
Prompting Best Practices¶
Two rules prevent 90% of tool-calling problems:
system_prompt: |
Never retry successful tool calls to get "better" results.
If a tool call fails, report the error - do not compute the result yourself.
Why:
- No retries on success: Agents sometimes want to "improve" results by calling the same tool again. This wastes tokens and can cause loops.
- No manual computation on failure: When a tool fails, agents often try to compute the answer themselves. This defeats the purpose of using tools.
Batch Operations¶
For multiple related queries, use batch functions:
# Instead of multiple calls
brave.search(query="topic 1")
brave.search(query="topic 2")
# Use batch
brave.search_batch(queries=["topic 1", "topic 2"])
Token Efficiency¶
- Use short prefixes (
__ot) for simple calls - Use code fences for multi-step operations
- Prefer batch operations over multiple calls