harness_eng
Same model, sixteen harnesses, two tasks. A controlled experiment that holds one LLM constant and varies only the agent harness around it. Two task types (HTML extraction + Python code generation). Eight harnesses benchmarked end-to-end (150 graded runs producing the published numbers); eight more cataloged in Phase 8, every common agent pattern named, mapped to its real-world framework analog, implemented + unit-tested against freeze tag 2af30fc, matrix re-run gated on hardware (the configured model needs more memory than this host has). One consistent methodology lesson on the benchmarked half; a structured catalog on the cataloged half so a reader can map the design space without waiting for the rerun.
Read the writeup
→ Full combined article, one page, both experiments, detailed per-harness breakdowns with framework analogs (LangChain / LangGraph / CrewAI / Aider / Cursor), forensics behind <details> toggles, Mermaid diagrams, dollar extrapolation at frontier-model list prices, and a cross-experiment conclusion.
→ Medium-ready HTML version, same content rendered as clean HTML (no Mermaid, no collapsibles), suitable for import into Medium, Substack, or any editor that accepts HTML paste.
The one-line finding
On hard tasks, complex harnesses fail more than simple ones. On easy tasks, complex harnesses cost more than simple ones. single_shot won on wall-clock in both experiments.
| experiment | tasks | harnesses | ceiling | winner on accuracy | winner on wall-clock |
|---|---|---|---|---|---|
| HTML extraction | 5 | 5 | 9/15 tied | single_shot / plan_execute | single_shot (217 s vs 1,957 s) |
| Code generation | 5 | 5 | 15/15 tied | all 5 harnesses | single_shot (283 s vs 598 s) |
Both experiments converge: complex harnesses pay returns only where the base model’s first-shot accuracy is both below target AND multi-turn-recoverable. Both conditions rarely hold at once on weak models. On glm-4.7-flash, neither experiment produced a case where a complex harness justified its extra tokens and time.
Repository
- Repo: github.com/jaafar-benabderrazak/harness-bench
- 16 harnesses across two families:
- Benchmarked (8):
single_shot,react,plan_execute,reflexion,minimal,chain_of_thought,test_driven,retry_on_fail - Cataloged (8):
multi_agent,self_consistency,tool_use_with_validation,tree_of_thoughts,react_with_replan,cached_react,program_aided,streaming_react(registered in code but excluded from the matrix per Ollama OOM finding)
- Benchmarked (8):
- Each harness maps to a real-world agent-engineering pattern (LangChain
AgentExecutor, LangGraph plan-and-execute, CrewAI/AutoGen multi-agent, Reflexion paper, ToT paper, PaL paper, Wang et al. self-consistency, Aider/Cursor/Devin TDD loops, …). The article gives every analog explicitly. - 2 task types:
html_extract(per-field normalized exact match) +code_gen(pytest subprocess). - 87 tests pass offline (no API key). CI runs on Ubuntu + Windows.
- Freeze tag
harnesses-frozen(currently2af30fc) pins the comparison; the runner refuses to execute if any gated file has drifted.
Reproduce either experiment
git clone https://github.com/jaafar-benabderrazak/harness-bench && cd harness-bench
pip install -e ".[dev]"
cp .env.example .env # ollama + glm-4.7-flash default, no API key
ollama pull glm-4.7-flash:latest
pytest -q # 87 tests, all offline
# HTML extraction (~60 min)
python scripts/run_full.py --seeds 3 --yes
# Code generation (~25-35 min)
python scripts/run_code_benchmark.py --seeds 3 --yes
# Post-process, produces CSV, charts, trace viewer, article
python scripts/make_chart.py
Fully local. Zero dollars.