15 failure modes · K/D/D framework · trajectory eval
The iris-ft-lab repo is a small Apple Silicon fine-tuning lab with two parallel tracks. Track A is Trace memory extraction — a token-level schema-classification problem where SFT promoted from 0/12 to 9/12 on a 12-case gold eval and DPO took the result to 12/12 with no regressions. Track B is collab-eval document tasks — a harness for spreadsheet cleanup, document revision, and citation grounding, documented in collab-eval — Reward Design for Document Tasks. Track B ran four SFT cycles and one DPO cycle. None promoted.
Across both tracks I catalogued fifteen distinct failures — the kind of bug that takes a
session each to diagnose and fix — in
FAILURE_MODES.md.
This note steps back from the flat catalog and asks: what do those fifteen modes look like
through the Knowing/Doing/Deciding lens, and what does the pattern say about evaluating
long-running training trajectories?
The Agent Epistemic Integrity whitepaper proposes three dimensions of runtime agent failure. Knowing is belief management — what the agent holds true about the world, itself, and its history. Doing is capability management — what cumulative effects its actions have produced and what remains valid to invoke. Deciding is planning and goal reasoning — how it selects and sequences actions given the prior two.
The three dimensions are coupled: stale beliefs drive invalid decisions, which produce contradictory actions. The framework was developed for runtime agents, but the same coupling structure shows up in any long-running iterative process — including the meta-trajectory of building a training pipeline across many sessions.
Six of the lab’s failures were Knowing failures: wrong beliefs about tools
(mlx_lm fuse silently no-ops on 4-bit weights without --dequantize),
about file states (stale shards conflicting with new ones after re-fuse), or about data quality
(gold consistently demonstrating row-deletion rather than preserve-and-convert across 147 of
240 cases).
Three were Doing failures: harness timeouts (Bash 10-min cap less than a 75-min training run), prompt orchestration assumptions (a YAML described in a code block but never actually written to disk), recipe cumulative effects (rank × dropout × iters compounding into mode collapse).
Six were Deciding failures: unreachable promotion gates (composite +0.10 from a 0.96 base), wall-time hypothesis errors (estimated by iters when seq² dominates), and four interlocking modes documenting the cumulative exhaustion of two instrument families on a single target dimension. The 6/3/6 split is itself a finding: most pain accumulated at the upstream (Knowing) and downstream (Deciding) ends of the coupling chain, while Doing failures were loud and one-session-bounded.
Knowing failures are the hardest to detect because they don’t surface as errors. The
mlx_lm fuse defect trained a numerically-perfect adapter (loss 0.693 → 0.001,
val accuracy 1.0, margin 7.8) on a meaningless foundation; only the §4.1 hard-stop rule
— baseline must reproduce 9/12 on the gold eval — caught it. Without that floor the next
session would have iterated on β and learning rate chasing a phantom DPO bug.
Doing failures are the loudest. The Bash 10-min cap returned training done while the
process was still running orphaned; once spotted, the fix was the run_in_background
polling pattern with an until grep -q "Iter N:" loop. Mode collapse during SFT v0
was visible from the first generated sample — outputs converged to 2444,1444,4444,4444
by row 12. v1’s gentler recipe (rank 4, dropout 0.05, fewer iters) fixed it in one cycle.
Deciding failures are the deepest. The cumulative-exhaustion verdict on row preservation
— three SFT recipes plus one DPO run, all flat at 0.25 stress data_preservation
— took five experimental cycles to surface. By the time the verdict landed, the right next
move was a class-level instrument switch, not within-class tuning. The companion writing post
When SFT and DPO could not teach
“don’t drop the row” walks the row-preservation trajectory in detail.
The trajectory-eval taxonomy
proposes five eval dimensions for long-running agents: epistemic drift rate (Knowing),
interruption and resumption fidelity (Doing), mid-task replanning quality (Deciding), goal
drift detection (Deciding), and multi-agent handoff fidelity (cross-cutting). Each maps to a
Knowing/Doing/Deciding dimension. A useful translation back to training trajectories: the
lab’s five-condition promotion gate — composite no regression, dim no regression, RH-like
no increase, one dim improves ≥ 0.02, stress data_preservation ≥ 0.85
— is essentially trajectory-level evaluation. It catches what turn-level results-table
comparisons miss.
The bucket classification — A “promoted” / B “with progress” / C “with concern” / D “hypothesis refuted” — emerged precisely because binary turn-level PROMOTED/NOT-PROMOTED throws away the trajectory information. A flat metric across multiple varied configurations is a trajectory signal, and naming it as one ruled out three more weeks of within-class iteration.
Training-instrument selection has a Knowing/Doing/Deciding structure of its own. SFT positive demonstrations operate on the Knowing layer — the model learns what tokens to emit given a prompt, derived from gold sequences. DPO preference pairs operate on the Knowing layer with discrimination — the model learns to assign higher likelihood to one fixed output than another. Neither operates on the Doing layer, where per-step generation decisions are made. RL with reward operates on Doing — rollout, observe, update.
The lab’s row-preservation failure was a Doing-layer behavior with Knowing-layer instruments applied; no amount of recipe tuning, data rebalancing, or curriculum ordering crossed that gap. The eval-tightening direction follows the same structure: Knowing-instrument failures need data audits and gate calibration; Doing-instrument failures need rollout-time observability; Deciding failures need bucket classifications and cumulative-trajectory tracking. None of this is novel individually. The framework gives a vocabulary for organizing it.
Two threads worth pulling, neither in this iteration. One: extend the
trajectory-eval taxonomy to training trajectories explicitly. The lab already has the patterns
— promotion gates, bucket classifications, cumulative flat-metric signaling — but they’re
scattered across four collab-eval reports and a FAILURE_MODES.md catalog. A
second-pass writing post unifying them as “trajectory-level eval for training” is
the natural follow-on to the original turn-level-vs-trajectory taxonomy.
Two: try the missing instrument. Row preservation as an RL-with-grader experiment is a small Mac-feasible run — 80–200 rollouts per update, the existing harness as the reward, KL-anchored at v3 SFT. The result would either confirm the lab’s diagnosis (RL crosses the Knowing/Doing gap; preservation lifts above 0.25) or surface a deeper structural issue (the grader itself rewards the wrong policy). Either outcome would close the loop on the negative-result trajectory documented here. I’m not running it now — the catalog is already the contribution — but the experimental hypothesis is well-defined enough that “open question” is a clean handoff state.
Five training cycles, fifteen failure modes, three dimensions. The framework was already there; the lab made it concrete.
Research notes, half-baked ideas. Probably overthought, definitely over-architected.