Vision Comparison: img.ask vs Direct Attachment¶
Date: 2026-03-05
Image: tests/data/products-med.png
Task: Extract prices from a 4-column product grid, left-to-right, top-to-bottom
Method: Subagent-isolated timing (tmr) + session usage delta (cld)
Price Grids¶
Approach A — img.ask¶
| Col 1 | Col 2 | Col 3 | Col 4 |
|---|---|---|---|
| $1,397.00 | $1,997.00 | $2,197.00 | $2,049.00 |
| $797.00 | $341.00 | $797.00 | $517.00 |
| $368.00 | $1,397.00 | $377.00 | $453.00 |
| $229.00 | $299.00 | $731.00 | $470.00 |
| $749.00 | $1,018.00 | $575.00 | $541.00 |
Approach B — Direct Attachment (Read tool → subagent)¶
| Col 1 | Col 2 | Col 3 | Col 4 |
|---|---|---|---|
| $727 | $341 | $197 | $249 |
| $368 | $1,597 | $377 | $453 |
| $229 | $299 | $231 | $470 |
Direct returned only 3 rows vs. 5 rows from img.ask.
Cell-by-cell Accuracy¶
Using img.ask as the reference. Rows 1–2 from Direct don't correspond to any img.ask row.
Rows 3–5 partially overlap with Direct rows 1–3.
| Row | Col 1 | Col 2 | Col 3 | Col 4 |
|---|---|---|---|---|
| 1 | ✗ | ✗ | ✗ | ✗ |
| 2 | ✗ | ✗ | ✗ | ✗ |
| 3 | ✓ | ✗ | ✓ | ✓ |
| 4 | ✓ | ✓ | ✗ | ✓ |
| 5 | ✗ | ✗ | ✗ | ✗ |
5 / 20 cells correct (25%)
Measurements¶
| Metric | img.ask | Direct |
|---|---|---|
| Time (s) | 41.26 | 34.00 |
| Total tokens | 0 (cross-session) | 190,221 |
| Cost (USD) | N/A (cross-session) | $0.078 |
| Output tokens | N/A | 567 |
| Cache read tokens | N/A | 185,832 |
| Cache create tokens | N/A | 3,526 |
img.askruns in its own API session — token usage does not appear in the host session's delta.
Verdict¶
| Aspect | Winner |
|---|---|
| Speed | Direct (34s vs 41s, ~17% faster) |
| Completeness | img.ask (5 rows vs 3 rows) |
| Accuracy | img.ask (25% of cells correct via Direct) |
| Token cost to host session | img.ask (cross-session, not charged) |
img.ask significantly outperformed direct attachment. The direct subagent (Haiku 4.5) returned only 3 of 5 product rows and hallucinated or misread most prices. img.ask correctly identified all 5 rows using a dedicated vision model.
Direct was ~7s faster but at the cost of heavily degraded accuracy — making speed the only advantage, and a poor trade-off for structured extraction tasks.