Harness Freeze Manifest
Freeze date: 2026-04-25
Freeze commit SHA: resolve with git rev-parse harnesses-frozen (tag pins an exact commit; self-referencing the SHA in this file is a chicken-and-egg problem, so the tag is the source of truth)
Git tag: harnesses-frozen
Why this exists
The entire experiment depends on one invariant: the harnesses, their shared tool dispatcher, and the single-import-site model client cannot be edited after results are seen. Without this discipline, the comparison is invalid — the classic “peek-and-patch” failure mode. The harnesses-frozen git tag plus the runner’s check_freeze_gate() pre-flight make this discipline structural rather than procedural.
Gated paths (runner refuses to execute if any of these have diverged from the tag)
src/harness_eng/harnesses/src/harness_eng/tools.pysrc/harness_eng/model.py
Per-file blob SHAs at freeze commit
| Path | Blob SHA |
|---|---|
src/harness_eng/harnesses/base.py |
7edfe0d83b3721314819a66b03f94ecdb25141c6 |
src/harness_eng/harnesses/single_shot.py |
d1dae663d1f0d39cd77f1dd95e9d8c1ee9e83856 |
src/harness_eng/harnesses/react.py |
3e0f63424ab0dd9e9b76b83833b92ee2631a913a |
src/harness_eng/harnesses/plan_execute.py |
71f07f0af510fe6b9121a8892fa23ff994d99fec |
src/harness_eng/harnesses/reflexion.py |
424c6dc71a79dccca0cb930d1e095f7f4c5b154c |
src/harness_eng/harnesses/minimal.py |
eeaf67d89c0de36f3c1ffe953f4109ccbb45ed0e |
src/harness_eng/harnesses/chain_of_thought.py |
139c18d592e83f269dcfdc432bc8f138c5c2c2ef |
src/harness_eng/harnesses/test_driven.py |
713177e3d59597376f3f29a854e016c21d758a91 |
src/harness_eng/harnesses/retry_on_fail.py |
54785816b04056bafc3edd374a23659b5e8f74e5 |
src/harness_eng/harnesses/tree_of_thoughts.py |
76147b6f4ad9b6ccd83cbb69a4af7dc898a3b172 |
src/harness_eng/harnesses/multi_agent.py |
debeaf3ce27943c1bb041836f5c248b1111b32cc |
src/harness_eng/harnesses/react_with_replan.py |
0a7646896b491471f73d72dffd9e493906a8f24a |
src/harness_eng/harnesses/self_consistency.py |
0c113d30e7e26fcbf0e109828a6189ee0f6405f1 |
src/harness_eng/harnesses/program_aided.py |
4992b8fb1a21b61a307ce21072cd286aeee019fa |
src/harness_eng/harnesses/tool_use_with_validation.py |
6a9db4795dab97eeabe4ffbc20863bc41fe005f8 |
src/harness_eng/harnesses/streaming_react.py |
f1b8a706bc21d0bf78ebdb9fb14befdccc1c9e2c |
src/harness_eng/harnesses/cached_react.py |
599a99d5b5102ca0adf08a42ba12e017f8224a8f |
src/harness_eng/harnesses/__init__.py |
9da8bf9f17edafd1b6ee407d34244267f4ba0f38 |
src/harness_eng/tools.py |
d099ea4ce41dd124df451730108c64202ec517d2 |
src/harness_eng/model.py |
f1ce85e7e4efcab50c0a76b1619cdd53fe9f368a |
Tag moves
| Date | From SHA | To SHA | Reason | Invalidated matrix runs |
|---|---|---|---|---|
| 2026-04-23 | 0a44719 |
4eacafb |
Phase 4 added TOOL_WHITELIST attrs to every harness class + enforcement in _step_model. No matrix had been executed against the prior tag, so the move is a mechanical re-anchor — nothing is invalidated. |
None |
| 2026-04-23 | 4eacafb |
d0fc1f1 |
CI-fix cleanup: removed unused field import from harnesses/base.py. No runtime behavior change, but blob SHA changes, so the tag re-anchors. Still no matrix runs invalidated. |
None |
| 2026-04-23 | d0fc1f1 |
9977e85 |
Backend pivot: added Ollama support in model.py so the experiment can run on a free local model (glm-4.7-flash by default). Phase-6 follow-up also added the code-gen task type and three code-gen baseline harnesses (chain_of_thought, test_driven, retry_on_fail) plus a single_shot branch on task.type so it stops dumping HTML for code-gen tasks. Gated files changed; tag re-anchored onto the new code path. Matrix had been re-run against the new tag during Phase 6 article work; that run was not invalidated by this row (it produced the data the article cited). |
None |
| 2026-04-25 | 9977e85 |
(this commit) | Phase 8 harness expansion: added 8 new harnesses (tree_of_thoughts, multi_agent, react_with_replan, self_consistency, program_aided, tool_use_with_validation, streaming_react, cached_react); added run_python tool to tools.py; threaded per-call temperature kwarg through model.call and Harness._step_model (base.py); added jsonschema runtime dep. streaming_react is registered in the HARNESSES dict but EXCLUDED from HARNESSES_BY_TASK_TYPE per .planning/phases/08-expand-harness-family/08-05-VERIFY.md outcome FAIL: the configured Ollama backend cannot host glm-4.7-flash on this hardware (model declares 23.4 GiB system memory; host has 6.9 GiB available; failure mode differs from the predicted Ollama issue #13840 post-tool-call halt but the registration implication is identical). Matrix membership: html_extract=11 harnesses, code_gen=9 harnesses. Per-harness control-flow tests added; full pytest suite 87/87 GREEN at the new tag. No matrix had been re-run against the prior tag with the expanded registry, so this move invalidates nothing — the matrix re-runs that follow this move (scripts/run_full.py, scripts/run_code_benchmark.py) are user-gated and produce the data the Phase 8 article refresh consumes. |
None |
What can still change post-freeze
src/harness_eng/runner.py— the orchestration layer (adding seeds, manifests, tool-allowlist assertions) is outside the experimental controlsrc/harness_eng/analysis.py,src/harness_eng/trace_viewer.py— read-only artifacts consuming the matrix outputsrc/harness_eng/cost_estimator.py,src/harness_eng/pricing.py— estimation / cost mathsrc/harness_eng/config.py— model id/temperature/max_tokens: not supposed to change, but not technically gated (change here would be visible in git history)src/harness_eng/trace.py,src/harness_eng/grader.py— already hardened in Phase 1; changes post-freeze are visible in git history
If any of the gated files need to change after freeze, the tag must be moved (which visibly invalidates the prior matrix run) and the article must document the reason. No quiet edits.
Verification
At any time:
git diff harnesses-frozen HEAD -- src/harness_eng/harnesses/ src/harness_eng/tools.py src/harness_eng/model.py
An empty diff means the experiment is still valid. A non-empty diff either reflects a legitimate re-tag (with article-level justification) or a methodology violation.