Skip to the content.

Harness Freeze Manifest

Freeze date: 2026-04-25 Freeze commit SHA: resolve with git rev-parse harnesses-frozen (tag pins an exact commit; self-referencing the SHA in this file is a chicken-and-egg problem, so the tag is the source of truth) Git tag: harnesses-frozen

Why this exists

The entire experiment depends on one invariant: the harnesses, their shared tool dispatcher, and the single-import-site model client cannot be edited after results are seen. Without this discipline, the comparison is invalid — the classic “peek-and-patch” failure mode. The harnesses-frozen git tag plus the runner’s check_freeze_gate() pre-flight make this discipline structural rather than procedural.

Gated paths (runner refuses to execute if any of these have diverged from the tag)

Per-file blob SHAs at freeze commit

Path Blob SHA
src/harness_eng/harnesses/base.py 7edfe0d83b3721314819a66b03f94ecdb25141c6
src/harness_eng/harnesses/single_shot.py d1dae663d1f0d39cd77f1dd95e9d8c1ee9e83856
src/harness_eng/harnesses/react.py 3e0f63424ab0dd9e9b76b83833b92ee2631a913a
src/harness_eng/harnesses/plan_execute.py 71f07f0af510fe6b9121a8892fa23ff994d99fec
src/harness_eng/harnesses/reflexion.py 424c6dc71a79dccca0cb930d1e095f7f4c5b154c
src/harness_eng/harnesses/minimal.py eeaf67d89c0de36f3c1ffe953f4109ccbb45ed0e
src/harness_eng/harnesses/chain_of_thought.py 139c18d592e83f269dcfdc432bc8f138c5c2c2ef
src/harness_eng/harnesses/test_driven.py 713177e3d59597376f3f29a854e016c21d758a91
src/harness_eng/harnesses/retry_on_fail.py 54785816b04056bafc3edd374a23659b5e8f74e5
src/harness_eng/harnesses/tree_of_thoughts.py 76147b6f4ad9b6ccd83cbb69a4af7dc898a3b172
src/harness_eng/harnesses/multi_agent.py debeaf3ce27943c1bb041836f5c248b1111b32cc
src/harness_eng/harnesses/react_with_replan.py 0a7646896b491471f73d72dffd9e493906a8f24a
src/harness_eng/harnesses/self_consistency.py 0c113d30e7e26fcbf0e109828a6189ee0f6405f1
src/harness_eng/harnesses/program_aided.py 4992b8fb1a21b61a307ce21072cd286aeee019fa
src/harness_eng/harnesses/tool_use_with_validation.py 6a9db4795dab97eeabe4ffbc20863bc41fe005f8
src/harness_eng/harnesses/streaming_react.py f1b8a706bc21d0bf78ebdb9fb14befdccc1c9e2c
src/harness_eng/harnesses/cached_react.py 599a99d5b5102ca0adf08a42ba12e017f8224a8f
src/harness_eng/harnesses/__init__.py 9da8bf9f17edafd1b6ee407d34244267f4ba0f38
src/harness_eng/tools.py d099ea4ce41dd124df451730108c64202ec517d2
src/harness_eng/model.py f1ce85e7e4efcab50c0a76b1619cdd53fe9f368a

Tag moves

Date From SHA To SHA Reason Invalidated matrix runs
2026-04-23 0a44719 4eacafb Phase 4 added TOOL_WHITELIST attrs to every harness class + enforcement in _step_model. No matrix had been executed against the prior tag, so the move is a mechanical re-anchor — nothing is invalidated. None
2026-04-23 4eacafb d0fc1f1 CI-fix cleanup: removed unused field import from harnesses/base.py. No runtime behavior change, but blob SHA changes, so the tag re-anchors. Still no matrix runs invalidated. None
2026-04-23 d0fc1f1 9977e85 Backend pivot: added Ollama support in model.py so the experiment can run on a free local model (glm-4.7-flash by default). Phase-6 follow-up also added the code-gen task type and three code-gen baseline harnesses (chain_of_thought, test_driven, retry_on_fail) plus a single_shot branch on task.type so it stops dumping HTML for code-gen tasks. Gated files changed; tag re-anchored onto the new code path. Matrix had been re-run against the new tag during Phase 6 article work; that run was not invalidated by this row (it produced the data the article cited). None
2026-04-25 9977e85 (this commit) Phase 8 harness expansion: added 8 new harnesses (tree_of_thoughts, multi_agent, react_with_replan, self_consistency, program_aided, tool_use_with_validation, streaming_react, cached_react); added run_python tool to tools.py; threaded per-call temperature kwarg through model.call and Harness._step_model (base.py); added jsonschema runtime dep. streaming_react is registered in the HARNESSES dict but EXCLUDED from HARNESSES_BY_TASK_TYPE per .planning/phases/08-expand-harness-family/08-05-VERIFY.md outcome FAIL: the configured Ollama backend cannot host glm-4.7-flash on this hardware (model declares 23.4 GiB system memory; host has 6.9 GiB available; failure mode differs from the predicted Ollama issue #13840 post-tool-call halt but the registration implication is identical). Matrix membership: html_extract=11 harnesses, code_gen=9 harnesses. Per-harness control-flow tests added; full pytest suite 87/87 GREEN at the new tag. No matrix had been re-run against the prior tag with the expanded registry, so this move invalidates nothing — the matrix re-runs that follow this move (scripts/run_full.py, scripts/run_code_benchmark.py) are user-gated and produce the data the Phase 8 article refresh consumes. None

What can still change post-freeze

If any of the gated files need to change after freeze, the tag must be moved (which visibly invalidates the prior matrix run) and the article must document the reason. No quiet edits.

Verification

At any time:

git diff harnesses-frozen HEAD -- src/harness_eng/harnesses/ src/harness_eng/tools.py src/harness_eng/model.py

An empty diff means the experiment is still valid. A non-empty diff either reflects a legitimate re-tag (with article-level justification) or a methodology violation.