When SFT and DPO could not teach “don’t drop the row”

Iris Shen · April 2026 · iris-axon-lab.github.io

This is the row-preservation chapter of a five-cycle training trajectory run on iris-ft-lab, a small Apple Silicon fine-tuning lab. The harness it tests against is collab-eval, a document-task grader documented in collab-eval — Reward Design for Document Tasks. Across four SFT runs and one DPO run, none of which promoted, the failure was specific and structural: SFT and DPO supervise pre-computed outputs, but row preservation is a generation-time policy decision. The Knowing/Doing/Deciding framing of this and the lab’s other failures lives in Knowing, Doing, Deciding — applied to a fine-tuning lab. This post is the trajectory-specific narrative.

1Why spreadsheet cleanup is a virtual-collaborator task

Cleaning a messy CSV is what a human spreadsheet operator does — fix the units, drop the annotation rows, normalize the headers, keep all the real data. It’s one of the few document tasks with a near-deterministic ground truth: rows in, rows out, units consistent. That makes it a clean test bed for fine-tuning experiments. The collab-eval harness measures four dimensions — data_preservation, format_validity, unit_consistency, completeness — each with a hard-fail cap for catastrophic outputs. The promotion gate adds a fifth condition: on a separate stress-eval split (15–25 row tables, all real data, gold preserves all rows), data_preservation must reach 0.85.

2Why valid CSV can still be catastrophically wrong

The most expensive failure isn’t malformed output; it’s well-formed output that quietly omits real data. An adapter that produces clean CSV with format_validity = 1.0 while dropping half the rows scores data_preservation = 0.5 — the exact “reward hacking” signature the harness exists to detect. A virtual collaborator that crashes is recoverable; the user notices. A virtual collaborator that silently drops rows is not — the user reads the clean output and trusts it. This is why the harness’s promotion gate is calibrated against the stress eval, not the regular eval: a model that nails 80 short tables but drops half the rows on a 20-row table is exactly the failure mode worth blocking.

3What SFT fixed: format and unit signals

Four SFT runs (v0 mode-collapsed, v1 gentler recipe, v2 with 50% stress training data, v3 with curriculum learning) all lifted unit_consistency from 0.86 to 1.00. Format validity stayed at 1.0. Composite stayed above 0.99. The pattern is consistent across the four runs: SFT teaches what the gold demonstrates literally — every preserved gold output has converted unit tokens, every preserved gold output has the right header. These are positive token-level signals. The gradient flows; the model learns. Token-level supervision works for token-level behaviors.

4What SFT could not fix: the generation boundary

Across the four SFT runs, stress data_preservation was 0.20 → 0.25 → 0.25 → 0.25. Three different interventions, identical outcome. The reason isn’t the recipe (v1 was gentler), the data quantity (v2 doubled stress cases), or the curriculum order (v3 trained stress-only first, then mixed at lower LR). The reason is structural: “preserve this row” is a decision the model makes during generation — to keep producing tokens past row 12 when context grows long and termination becomes increasingly attractive — and SFT supervision never reaches that decision. The gold has many tokens demonstrating “convert to $K.” The gold has zero tokens explicitly demonstrating “I am choosing to continue past this row.” Preservation is the absence of a deletion the model would otherwise produce. SFT cannot supervise an absence.

5What DPO learned too easily: fixed-output discrimination

DPO v0 trained on 80 preference pairs — chosen = preserve all rows, rejected = drop a random fraction. Loss collapsed to 0.001 by iter 30. Train accuracy 1.0. Final reward margin 7.7. By every training metric the run was textbook. Stress data_preservation dropped to 0.20 — below the SFT v3 baseline. DPO learned the discrimination (“CSV with N rows beats CSV with N−k rows”) trivially. But that discrimination is over pre-computed sequences. At inference time, the model still has to decide whether to emit row 14 or stop. The DPO gradient sharpens log-ratios over fixed outputs; it does not rewire generation-time decisions.

6Why I stop here

The next standard instrument is reinforcement learning with the grader as reward: the model rolls out a generation, the grader scores it, the policy gradient updates the generation-time decision directly. That instrument lands at the right layer. I’m not running it. The lesson the trajectory has already taught — the training instrument must match where the behavior is decided — is the result. Five experimental cycles produced one durable conclusion. Another cycle would test how much an RL run improves preservation; it would not change the conclusion. Five training runs, one durable result. The negative result is the contribution.


Five training runs, one durable result. Repo: iris-axon-lab/iris-ft-lab.

Research notes, half-baked ideas. Probably overthought, definitely over-architected.