Notes — Iris Shen

Builder's notes

Knowing, Doing, Deciding — applied to a fine-tuning lab ↗

Fifteen catalogued failures from the iris-ft-lab fine-tuning lab map onto the Agent Epistemic Integrity framework with a 6/3/6 split. Most pain accumulated at the upstream and downstream ends of the coupling chain. What the pattern says about training-instrument selection and trajectory-level evaluation.

collab-eval — Reward Design for Document Tasks ↗

Builder's notes for collab-eval — a task environment and grader harness for open-ended document manipulation tasks, built to make reward design explicit. The core question: what does it take to build a grading system an RL loop can trust?

When SFT and DPO could not teach "don't drop the row" ↗

The row-preservation chapter of the collab-eval fine-tuning trajectory: four SFT runs and one DPO run, none promoted. Stress data_preservation flat at 0.20–0.25 across every cycle. SFT and DPO supervise pre-computed outputs; row preservation is a generation-time policy decision.

Trace — Initial Version ↗

Builder's notes for Trace — a local-first witness agent that reads your AI conversation history and generates weekly narrative dispatches. Not a summary or dashboard, but a short-form piece of writing that surfaces patterns, tensions, and the distance between intention and action across time.

Learner's notes

How Modern AI Reads: Three Ways to Solve the Attention Problem ↗

A learner's guide to the three main strategies modern AI uses to handle long context — dense attention, sliding window, and memory-augmented retrieval — and the tradeoffs each makes between speed, coverage, and cost.

On agentic systems

The hardest part of long-running tasks isn't execution — it's knowing when to stop, restart, or ask.
Orchestration failures are almost always memory failures in disguise.
"Tool calling" is not a skill. Knowing which tool, when, with what confidence — that's the skill.

On evaluation

If your eval takes two weeks of human annotation, it will never be run often enough to matter.
Swarm-based evals are underrated. Specialized critic agents catch what generalist reviewers miss.
Metrics lie. Distributions tell the truth.

Longer treatment: Trajectory-Level Eval Taxonomy →

On research and writing

The whitepaper that changes one person's mental model is worth more than the deck that impresses twenty.
Naming things precisely is 80% of the intellectual work.
Half-baked ideas, written down, are worth more than fully-baked ideas that stay in your head.

Updated irregularly · Last touched April 2026