Builder's Notes · April 2026

collab-eval:
reward design for document tasks

task environments · grader harness · reward hacking resistance

✓ runnable offline ⟳ 3 task types · 35 scored cases Anthropic SDK · optional LLM judge iris-axon-lab / iris-ft-lab

How to read this note. collab-eval is a task environment and grader harness for open-ended document manipulation, built to make reward design explicit before any RL loop sees the signal. The harness was the starting point for a five-cycle fine-tuning trajectory on iris-ft-lab: four SFT runs (recipe, data quantity, curriculum) and one DPO run (preference pairs over preserve-vs-drop). None promoted — stress data_preservation stayed in the 0.20–0.25 band across every run; the trajectory itself became the result. Sections 01–07 document the harness as designed; sections 08–10 cover the deterministic eval suite, the fine-tuning trajectory, and the lab-level open questions that fell out of it. Cross-cutting analysis lives in Knowing, Doing, Deciding — applied to a fine-tuning lab; the row-preservation chapter specifically lives in When SFT and DPO could not teach “don’t drop the row”.

01 The problem with rewarding document tasks

Code tasks have a cheap, reliable oracle: does the code pass the tests? The ground truth is executable, deterministic, and binary. A grader that runs the test suite is not making a judgment call.

Document tasks have no equivalent. "Is this revision better?" is not a question with a recoverable ground truth. The space of valid outputs is large, quality differences are often subtle, and different raters will disagree on what counts as an improvement. This creates three distinct problems for RL reward design, each requiring a different mitigation:

NO BINARY ORACLE

Any function that reliably returns True for good revisions will be wrong in a non-trivial fraction of cases — and agents will find the failure modes.

MULTI-DIMENSIONAL QUALITY

A document can be clearer but less faithful. Collapsing trade-offs into a single scalar loses information that matters for policy design.

LAZY PROXY METRICS

Metrics that correlate with quality in expectation (fluency, word count, format validity) are far cheaper to satisfy than the underlying criteria they proxy for.

The fundamental error in most document task reward design is treating it as a grading problem rather than a distribution characterization problem. You are not asking "did the model get it right" — you are asking "does this output sit in the region of the preference distribution that a thoughtful human would endorse, and which dimensions of that region are individually gameable."

02 What the harness does

collab-eval provides three runnable task environments, a deterministic grader layer, an optional LLM-as-judge layer, and a composite grader with hard-fail caps. collab-eval began as an evaluation and environment harness. The repo now also contains the measured SFT/DPO trajectory described in §§08–10. The design note covers how it would extend to an RL loop.

TASKS

doc revision · spreadsheet clean · citation grounding

GRADERS

deterministic · LLM judge · composite + hard-fail

TESTS

offline pytest · 4 canonical adversarial demos · 35 scored eval cases

Each task is chosen because it has a specific, non-obvious grader failure mode. The adversarial cases in tests/test_reward_hacking_cases.py are not smoke tests — they are executable embodiments of known failure modes, each with a comment explaining why a naive grader would fail and what the harness catches instead.

03 Four canonical failure modes, made executable

Each case below is one of the canonical adversarial tests in the repo. The extended eval suite adds broader quality-range and reward-hacking probes, but these four are the clearest demonstrations of the design. The "naive score" is what a single holistic grader would award. The "harness score" is what the composite grader produces after decomposition and hard-fail caps are applied.

DOC REVISION · word_count_exceeded flag

An agent produces a fluent, well-structured revision that satisfies the passive-voice check and gets the core content right — then pads the output to 400+ words with filler sentences. A holistic grader scores it highly for coherence and instruction adherence. The composite harness triggers word_count_exceeded and hard-caps the score at 0.3.

NAIVE HOLISTIC GRADER

0.82

Fluent prose, correct content, active voice. Word count check passes if limit is not tight. Padding looks like thoroughness.

HARNESS COMPOSITE

≤ 0.30

Hard-fail cap applied. word_count_exceeded fires at 120% of limit (360 words). Hard constraint violations cannot hide behind good scores on cheap dimensions.

Design insight: Soft word-count decay (score degrades above limit) handles marginal violations. The hard-fail cap at 120% handles the case where the agent learns to stay just inside the soft decay curve while still gaming the overall signal.

SPREADSHEET CLEANUP · data_preservation = 0.0

An agent produces perfectly valid, correctly-headed CSV with normalized units and clean date formats — by quietly dropping the 9 most problematic rows, keeping only 3 clean ones. A naive format-validity grader awards a near-perfect score. The composite harness catches the row loss via row_count_preserved against a known expected count.

NAIVE FORMAT GRADER

0.95

Valid CSV, correct headers, normalized units, clean dates. A real data pipeline would load this without error — and silently lose 75% of the data.

HARNESS COMPOSITE

< 0.70

data_preservation = 0.0 (row count 3 vs. expected 12). Weighted at 0.35, this floors the composite regardless of perfect format scores.

Design insight: This is the most insidious failure mode in the harness. The output is useful in isolation — it's just wrong relative to the task. Format-validity checkers are necessary but not sufficient; every cleanup task needs a ground-truth preservation check.

CITATION GROUNDING · no_citations_found hard-fail

An agent produces a well-argued, fluent revision of the research brief — but includes no citation markers at all. The argument reads as confident and grounded. A holistic judge might score this highly for quality of reasoning. The harness catches the complete absence of citation form and applies a hard-fail cap.

NAIVE HOLISTIC GRADER

0.78

High-quality prose, coherent argument, no hallucinations detected. If the grader doesn't check citation form explicitly, uncited outputs look fine.

HARNESS COMPOSITE

≤ 0.20

no_citations_found flag fires. citation_present < 0.5. For this task type, zero citation markers is a hard constraint violation, not a soft penalty.

Design insight: The companion test (test_citation_markers_present_but_graders_signal_unassessed) is equally important: an agent that does insert markers scores well on citation_present but citation_accurate and hallucination_flag stay at 0.5 (unassessed) until an LLM judge evaluates the substance. Form and substance are graded separately.

DOC REVISION · faithfulness held at 0.5

An agent produces a revision that is entirely in active voice, comfortably under 300 words, and structurally improved — but adds one invented claim not present in the source document. The deterministic checks all pass. No hard-fail flag fires. The composite score is moderate, not catastrophic — but faithfulness stays at 0.5 (unassessed) rather than 1.0, signaling that this output cannot be trusted without LLM judgment.

NAIVE RULE GRADER

0.88

Active voice passes. Word count passes. No hard-fail flags. A grader that only checks deterministic dimensions would fully reward this output.

HARNESS COMPOSITE

0.3–0.9

faithfulness = 0.5 (unassessed, not passing). The harness flags the gap explicitly rather than papering over it. The output cannot be promoted to high reward without LLM assessment of faithfulness.

Design insight: This is the most important case. 0.5 means "I don't know" — not "passing." The harness is designed to make the gap between "the deterministic checks passed" and "this output is safe to reward" explicit at every step. An RL loop that treats 0.5 as passing would learn to game the deterministic dimensions while freely adding unsupported claims.

04 Where the deterministic / LLM boundary sits

The principle: cheap and transparent beats clever and opaque. Every dimension that has a recoverable ground truth uses a deterministic check. LLM judges are reserved for dimensions that genuinely require judgment — and are called once per dimension with a specific rubric, never holistically.

DETERMINISTIC

word count · passive voice ratio

CSV parseability · row count preserved

unit consistency · column completeness

citation marker presence

LLM JUDGE (per-dimension)

faithfulness — did new claims enter?

quality_delta — is the output actually better?

citation_accurate — does the cite trace to source?

hallucination_flag — stronger than source supports?

Per-dimension scoring matters because holistic judge scores are gameable in a way that decomposed ones are not. An agent can learn what a judge finds "overall good" and optimize toward that surface. A rubric that asks only "is this faithful to the source, score 0–1" with explicit scoring criteria is significantly harder to systematically exploit.

Each deterministic heuristic has a # Design note: comment in the source documenting its known failure mode and why it is used anyway. The passive voice regex misses complex constructions. The row count check does not verify values. Citation presence checks markers, not accuracy. The point is not that these heuristics are perfect — it is that their failure modes are visible and documented, unlike a holistic judge whose failures are opaque.

05 The open problem I didn't solve

Inter-rater reliability. The LLM judge's rubrics have not been calibrated against human raters, and calibration is essential before using a harness like this for RL training.

What makes this especially hard for document quality is that rater disagreement is asymmetric: raters converge reliably on bad outputs but diverge significantly on what "good" looks like. A calibration set that samples uniformly from the output distribution will be dominated by mediocre examples where raters agree — and will not cover the high-quality range where the disagreement that actually matters for RL reward lives. Calibration requires oversampling that high-quality tail, which requires either a capable generator or manual curation. Neither is free.

This is the gap between "a principled harness" and "a harness you can actually train with." The adversarial cases demonstrate the failure modes the harness was designed to catch. They do not demonstrate that the harness catches them consistently across a real output distribution. That requires calibration data this harness does not have.

06 What's next

now Calibrate the LLM judge rubrics against a small human-rated set — even 30–50 examples per dimension would establish whether the per-dimension scoring is systematic or noisy.
now Parameterize the spreadsheet task for synthetic task generation — variable row counts, unit mixing ratios, blank-row density. The other task types follow once the generation pattern is established.
soon Wire the harness to a small RL loop using GRPO or similar. SFT and DPO already ran (five cycles; none promoted — see §09); RL with grader-as-reward is the natural next instrument for the row-preservation problem. It is intentionally out of scope for this iteration of the lab, but remains a well-formed open question for future work.
soon Multi-turn tasks. All three current tasks assume single-turn completion. Document editing where the agent revises, receives feedback, and revises again is a harder environment to reward correctly and is the realistic virtual collaborator scenario.
later Connect to trace — the companion project. Trace generates structured longitudinal memory; collab-eval provides the reward grounding for training a model to work with that memory. The two repos are currently independent; the connection is the fine-tuning pipeline in iris-ft-lab.

07 Why I built this

I've spent three years watching graders fail in production. The failure modes in this harness are not hypothetical — they are the same class of failures that surface repeatedly in multi-step systems when the reward signal is under-specified: agents that satisfy the letter of the constraint while violating the intent, outputs that pass every check and are still wrong, graders that report high scores on exactly the cases where they should be most skeptical.

Most of the infrastructure work I've done has been on the runtime side — memory, orchestration, evaluation at serving time. This project is an attempt to get upstream of those problems: to think about reward design at training time, before the policy is baked. The failure modes look different from that angle. They're still recognizable.

The design note at docs/rl_env_design.md covers the full reward decomposition strategy, extension paths to a real RL loop, and the open problems the harness does not address. The code is in iris-ft-lab/collab-eval.

08 Deterministic eval suite

✓ offline pytest suite ⟳ 3 task types · 35 scored eval cases base composite 0.67 · det-only FT v1 row pending

The eval suite is extended. tests/test_eval_extended.py adds 35 scored cases across all three task types — 12 doc revision, 12 spreadsheet clean, 11 citation grounding — with 9 reward-hacking probes distributed across the task types. These complement the four canonical adversarial demonstrations in tests/test_reward_hacking_cases.py. The suite is designed to run offline in deterministic-only mode, with no Anthropic credentials required.

EVAL SUITE

35 scored cases · 4 canonical adversarial demos

RH PROBES

9 scored RH probes · plus canonical demos

BASE SCORE

0.67 mean composite · det-only · synthetic range

scripts/score_eval.py runs 35 curated synthetic outputs — spanning ideal through catastrophic, with deliberate reward-hacking attempts — through the deterministic graders and writes results/eval_results_v1.md. The 0.67 mean composite reflects that curated distribution. It is not a real model inference result. The synthetic outputs are stand-ins for the quality range a base model might produce before any RL training; the harness was designed to score real model outputs once a checkpoint exists.

The FT v1 row in results/eval_results_v1.md was originally a labeled placeholder pending a training run. That training has since happened — see §09 for the trajectory and the result.

The three new reward-hacking probe types extend the coverage beyond the original four cases. The irregular passives probe confirms that _PASSIVE_RE does not catch past participles like written, drawn, set, sent — irregular forms that do not end in -ed and therefore fall through the regex entirely. The unit-in-notes probe confirms that $1.8M does not match the \$M forbidden pattern — there is no $M substring in the string $1.8M, so numeric cells can be correctly converted while the original M-notation hides in the Notes field undetected. The row-duplication and citation-argument-inversion probes cover preservation gaming and citation form without substance respectively. All four score as expected: the grader is fooled, the test confirms it. Known limitations, now executable. In the extended results, some RH probes intentionally retain high deterministic composite scores; those are not claimed successes. They are executable blind-spot demonstrations showing where row authenticity, source diversity, or semantic inversion require stronger checks or LLM/human judgment.

The public repo now keeps the detailed run artifact in results/eval_results_v1.md. That file carries the base/FT v1 table, the deterministic-only scoring summary, the explicit FT placeholder, and the per-dimension breakdown. The builder's note is therefore the public-facing narrative layer; the repo remains the runnable artifact and result log.

09 Fine-tuning trajectory

Five training cycles ran against the spreadsheet_clean environment. SFT v0 mode-collapsed under too-aggressive recipe (rank 8, zero dropout, 1000 iters); v1 fixed the recipe and lifted regular composite to 0.99 but stress data_preservation came in at 0.25; v2 doubled stress training data — flat at 0.25; v3 ran two-phase curriculum (stress-only first, then mixed at lower LR) — flat at 0.25; DPO v0 trained 80 preserve-vs-drop preference pairs, learned the discrimination trivially (loss collapse to 0.001 by iter 30, train accuracy 1.0), and regressed preservation to 0.20 with collateral damage to unit_consistency. Across four targeted preservation interventions spanning two instrument families, the metric did not move. Co-located metrics on the same eval (unit_consistency, format_validity) transferred cleanly, ruling out recipe or capacity as the bottleneck. The structural reason — preservation is a generation-time policy decision while SFT and DPO supervise pre-computed outputs — is the headline finding. The full row-preservation trajectory and analysis is in When SFT and DPO could not teach “don’t drop the row”; the cross-cutting lessons across all 15 catalogued failures are in Knowing, Doing, Deciding — applied to a fine-tuning lab.

done Five training cycles run against the spreadsheet_clean environment: SFT v0 (mode-collapsed), SFT v1 (gentler recipe), SFT v2 (more stress data), SFT v3 (curriculum learning), and DPO v0 (preference learning over preserve-vs-drop pairs). All five NOT PROMOTED. Stress data_preservation stayed in the 0.20–0.25 band across every run; the 0.85 promotion gate held the line. Full per-run reports under collab-eval/results/ in iris-axon-lab/iris-ft-lab.
done Identified the structural reason: row preservation is a generation-time policy decision, and SFT positive demonstrations + DPO preference pairs both supervise pre-computed outputs. Neither instrument reaches the per-step generation decision. The five-reason analysis lives in collab-eval/docs/preservation_analysis.md.

10 Next steps & open questions

The harness-level next steps from §06 still apply — calibrate LLM judge rubrics, parameterize task generation, wire a real RL loop — but the trajectory adds two lab-level questions worth flagging.

intentionally out of scope The next standard instrument for row preservation would be RL with grader-as-reward (PPO or GRPO via mlx-lm-lora’s --train-mode grpo). The hypothesis is well-defined: rollout-time feedback should cross the Knowing/Doing gap that SFT and DPO could not. Iterating further on this single target is intentionally left out of scope for this iteration of the lab — the catalog of negative results is already the contribution. The experiment remains a well-formed open question for future work.
soon Configure the LLM judge for the document-revision and citation-grounded environments. Four dimensions — faithfulness, quality_delta, citation_accurate, hallucination_flag — still hold at 0.5 in deterministic-only mode. The spreadsheet-cleanup track is closed; the document-task tracks remain open for LLM-judge wiring.
later Extend the trajectory-eval taxonomy to training trajectories explicitly. The lab’s promotion gates, bucket classifications, and cumulative flat-metric signaling all behave like trajectory-level eval surfaces — but they’re scattered across four reports and one catalog. A second-pass writing post would naturally unify them as “trajectory-level eval for training”.

collab-eval:reward design for document tasks

01 The problem with rewarding document tasks

02 What the harness does

03 Four canonical failure modes, made executable

04 Where the deterministic / LLM boundary sits

05 The open problem I didn't solve

06 What's next

07 Why I built this

08 Deterministic eval suite

09 Fine-tuning trajectory

10 Next steps & open questions

collab-eval:
reward design for document tasks