Builder's Notes  ·  April 2026

knowing, doing, deciding:
applied to a fine-tuning lab

15 failure modes · K/D/D framework · trajectory eval

Fifteen catalogued failures from a small Apple Silicon fine-tuning lab, mapped onto the Agent Epistemic Integrity framework. The 6/3/6 split — six Knowing failures, three Doing, six Deciding — is itself a finding: most pain accumulated at the upstream and downstream ends of the coupling chain. This note steps back from the catalog to ask what the pattern says about training-instrument selection and how trajectory-level eval extends to training trajectories themselves.

01 The setup

The iris-ft-lab repo is a small Apple Silicon fine-tuning lab with two parallel tracks. Track A is Trace memory extraction — a token-level schema-classification problem where SFT promoted from 0/12 to 9/12 on a 12-case gold eval and DPO took the result to 12/12 with no regressions. Track B is collab-eval document tasks — a harness for spreadsheet cleanup, document revision, and citation grounding, documented in collab-eval — Reward Design for Document Tasks. Track B ran four SFT cycles and one DPO cycle. None promoted.

Across both tracks I catalogued fifteen distinct failures — the kind of bug that takes a session each to diagnose and fix — in FAILURE_MODES.md. This note steps back from the flat catalog and asks: what do those fifteen modes look like through the Knowing/Doing/Deciding lens, and what does the pattern say about evaluating long-running training trajectories?

02 The framework, briefly

The Agent Epistemic Integrity whitepaper proposes three dimensions of runtime agent failure. Knowing is belief management — what the agent holds true about the world, itself, and its history. Doing is capability management — what cumulative effects its actions have produced and what remains valid to invoke. Deciding is planning and goal reasoning — how it selects and sequences actions given the prior two.

The three dimensions are coupled: stale beliefs drive invalid decisions, which produce contradictory actions. The framework was developed for runtime agents, but the same coupling structure shows up in any long-running iterative process — including the meta-trajectory of building a training pipeline across many sessions.

03 The fifteen modes mapped

Six of the lab’s failures were Knowing failures: wrong beliefs about tools (mlx_lm fuse silently no-ops on 4-bit weights without --dequantize), about file states (stale shards conflicting with new ones after re-fuse), or about data quality (gold consistently demonstrating row-deletion rather than preserve-and-convert across 147 of 240 cases).

Three were Doing failures: harness timeouts (Bash 10-min cap less than a 75-min training run), prompt orchestration assumptions (a YAML described in a code block but never actually written to disk), recipe cumulative effects (rank × dropout × iters compounding into mode collapse).

Six were Deciding failures: unreachable promotion gates (composite +0.10 from a 0.96 base), wall-time hypothesis errors (estimated by iters when seq² dominates), and four interlocking modes documenting the cumulative exhaustion of two instrument families on a single target dimension. The 6/3/6 split is itself a finding: most pain accumulated at the upstream (Knowing) and downstream (Deciding) ends of the coupling chain, while Doing failures were loud and one-session-bounded.

04 How each dimension surfaces

Knowing failures are the hardest to detect because they don’t surface as errors. The mlx_lm fuse defect trained a numerically-perfect adapter (loss 0.693 → 0.001, val accuracy 1.0, margin 7.8) on a meaningless foundation; only the §4.1 hard-stop rule — baseline must reproduce 9/12 on the gold eval — caught it. Without that floor the next session would have iterated on β and learning rate chasing a phantom DPO bug.

Doing failures are the loudest. The Bash 10-min cap returned training done while the process was still running orphaned; once spotted, the fix was the run_in_background polling pattern with an until grep -q "Iter N:" loop. Mode collapse during SFT v0 was visible from the first generated sample — outputs converged to 2444,1444,4444,4444 by row 12. v1’s gentler recipe (rank 4, dropout 0.05, fewer iters) fixed it in one cycle.

Deciding failures are the deepest. The cumulative-exhaustion verdict on row preservation — three SFT recipes plus one DPO run, all flat at 0.25 stress data_preservation — took five experimental cycles to surface. By the time the verdict landed, the right next move was a class-level instrument switch, not within-class tuning. The companion writing post When SFT and DPO could not teach “don’t drop the row” walks the row-preservation trajectory in detail.

05 Connection to trajectory-level eval

The trajectory-eval taxonomy proposes five eval dimensions for long-running agents: epistemic drift rate (Knowing), interruption and resumption fidelity (Doing), mid-task replanning quality (Deciding), goal drift detection (Deciding), and multi-agent handoff fidelity (cross-cutting). Each maps to a Knowing/Doing/Deciding dimension. A useful translation back to training trajectories: the lab’s five-condition promotion gate — composite no regression, dim no regression, RH-like no increase, one dim improves ≥ 0.02, stress data_preservation ≥ 0.85 — is essentially trajectory-level evaluation. It catches what turn-level results-table comparisons miss.

The bucket classification — A “promoted” / B “with progress” / C “with concern” / D “hypothesis refuted” — emerged precisely because binary turn-level PROMOTED/NOT-PROMOTED throws away the trajectory information. A flat metric across multiple varied configurations is a trajectory signal, and naming it as one ruled out three more weeks of within-class iteration.

06 Training instruments through the K/D/D lens

Training-instrument selection has a Knowing/Doing/Deciding structure of its own. SFT positive demonstrations operate on the Knowing layer — the model learns what tokens to emit given a prompt, derived from gold sequences. DPO preference pairs operate on the Knowing layer with discrimination — the model learns to assign higher likelihood to one fixed output than another. Neither operates on the Doing layer, where per-step generation decisions are made. RL with reward operates on Doing — rollout, observe, update.

The lab’s row-preservation failure was a Doing-layer behavior with Knowing-layer instruments applied; no amount of recipe tuning, data rebalancing, or curriculum ordering crossed that gap. The eval-tightening direction follows the same structure: Knowing-instrument failures need data audits and gate calibration; Doing-instrument failures need rollout-time observability; Deciding failures need bucket classifications and cumulative-trajectory tracking. None of this is novel individually. The framework gives a vocabulary for organizing it.

07 What’s next

Two threads worth pulling, neither in this iteration. One: extend the trajectory-eval taxonomy to training trajectories explicitly. The lab already has the patterns — promotion gates, bucket classifications, cumulative flat-metric signaling — but they’re scattered across four collab-eval reports and a FAILURE_MODES.md catalog. A second-pass writing post unifying them as “trajectory-level eval for training” is the natural follow-on to the original turn-level-vs-trajectory taxonomy.

Two: try the missing instrument. Row preservation as an RL-with-grader experiment is a small Mac-feasible run — 80–200 rollouts per update, the existing harness as the reward, KL-anchored at v3 SFT. The result would either confirm the lab’s diagnosis (RL crosses the Knowing/Doing gap; preservation lifts above 0.25) or surface a deeper structural issue (the grader itself rewards the wrong policy). Either outcome would close the loop on the negative-result trajectory documented here. I’m not running it now — the catalog is already the contribution — but the experimental hypothesis is well-defined enough that “open question” is a clean handoff state.

Five training cycles, fifteen failure modes, three dimensions. The framework was already there; the lab made it concrete.

Research notes, half-baked ideas. Probably overthought, definitely over-architected.