This is the row-preservation chapter of a five-cycle training trajectory run on iris-ft-lab, a small Apple Silicon fine-tuning lab. The harness it tests against is collab-eval, a document-task grader documented in collab-eval — Reward Design for Document Tasks. Across four SFT runs and one DPO run, none of which promoted, the failure was specific and structural: SFT and DPO supervise pre-computed outputs, but row preservation is a generation-time policy decision. The Knowing/Doing/Deciding framing of this and the lab’s other failures lives in Knowing, Doing, Deciding — applied to a fine-tuning lab. This post is the trajectory-specific narrative.
Cleaning a messy CSV is what a human spreadsheet operator does — fix the units, drop the
annotation rows, normalize the headers, keep all the real data. It’s one of the few document
tasks with a near-deterministic ground truth: rows in, rows out, units consistent. That makes
it a clean test bed for fine-tuning experiments. The collab-eval harness measures four
dimensions — data_preservation, format_validity, unit_consistency, completeness —
each with a hard-fail cap for catastrophic outputs. The promotion gate adds a fifth condition:
on a separate stress-eval split (15–25 row tables, all real data, gold preserves all rows),
data_preservation must reach 0.85.
The most expensive failure isn’t malformed output; it’s well-formed output that quietly omits
real data. An adapter that produces clean CSV with format_validity = 1.0 while dropping half
the rows scores data_preservation = 0.5 — the exact “reward hacking” signature the harness
exists to detect. A virtual collaborator that crashes is recoverable; the user notices. A
virtual collaborator that silently drops rows is not — the user reads the clean output and
trusts it. This is why the harness’s promotion gate is calibrated against the stress eval, not
the regular eval: a model that nails 80 short tables but drops half the rows on a 20-row table
is exactly the failure mode worth blocking.
Four SFT runs (v0 mode-collapsed, v1 gentler recipe, v2 with 50% stress training data, v3
with curriculum learning) all lifted unit_consistency from 0.86 to 1.00. Format validity
stayed at 1.0. Composite stayed above 0.99. The pattern is consistent across the four runs:
SFT teaches what the gold demonstrates literally — every preserved gold output has converted
unit tokens, every preserved gold output has the right header. These are positive token-level
signals. The gradient flows; the model learns. Token-level supervision works for token-level
behaviors.
Across the four SFT runs, stress data_preservation was 0.20 → 0.25 → 0.25 → 0.25. Three
different interventions, identical outcome. The reason isn’t the recipe (v1 was gentler), the
data quantity (v2 doubled stress cases), or the curriculum order (v3 trained stress-only first,
then mixed at lower LR). The reason is structural: “preserve this row” is a decision the model
makes during generation — to keep producing tokens past row 12 when context grows long and
termination becomes increasingly attractive — and SFT supervision never reaches that decision.
The gold has many tokens demonstrating “convert to $K.” The gold has zero tokens explicitly
demonstrating “I am choosing to continue past this row.” Preservation is the absence of a
deletion the model would otherwise produce. SFT cannot supervise an absence.
DPO v0 trained on 80 preference pairs — chosen = preserve all rows, rejected = drop a random
fraction. Loss collapsed to 0.001 by iter 30. Train accuracy 1.0. Final reward margin 7.7. By
every training metric the run was textbook. Stress data_preservation dropped to 0.20 —
below the SFT v3 baseline. DPO learned the discrimination (“CSV with N rows beats CSV with
N−k rows”) trivially. But that discrimination is over pre-computed sequences. At inference
time, the model still has to decide whether to emit row 14 or stop. The DPO gradient sharpens
log-ratios over fixed outputs; it does not rewire generation-time decisions.
The next standard instrument is reinforcement learning with the grader as reward: the model rolls out a generation, the grader scores it, the policy gradient updates the generation-time decision directly. That instrument lands at the right layer. I’m not running it. The lesson the trajectory has already taught — the training instrument must match where the behavior is decided — is the result. Five experimental cycles produced one durable conclusion. Another cycle would test how much an RL run improves preservation; it would not change the conclusion. Five training runs, one durable result. The negative result is the contribution.
Five training runs, one durable result. Repo: iris-axon-lab/iris-ft-lab.
Research notes, half-baked ideas. Probably overthought, definitely over-architected.