A runnable grader and training lab for open-ended document tasks — document revision, spreadsheet cleanup, and citation grounding. It started as a reward-design harness and now includes a five-cycle training trajectory: SFT v0–v3 and DPO v0 on spreadsheet row preservation. None promoted; the result is a negative finding about instrument mismatch — SFT/DPO over precomputed outputs did not reach a generation-time row-preservation policy. Code: iris-ft-lab/collab-eval ↗ · Builder note · Failure modes · Row-preservation writeup

A local-first longitudinal agent that reads AI conversation history and generates weekly narrative dispatches. Runs entirely on Ollama (qwen3:14b). Succeeds only when you feel witnessed — never when a task is completed. The design question it answers: what does memory look like when the goal is not recall but becoming?

Personal · 2026

Encoding memory retrieval and generation behavior as learned policy rather than prompted behavior. First promoted result in iris-ft-lab (Track A): SFT + DPO on Trace Layer 2 memory extraction — classifying episodic, semantic, procedural, and prospective tiers from raw trace text — reaching 12/12 on the gold eval. The hard case was aspiration-vs-commitment; 80 synthetic DPO preference pairs resolved all residual SFT errors with no regressions. Next: extending policy learning to memory retrieval preferences and generation constraints across longer Trace sessions.

Long-Running Task Evaluation
Microsoft M365 Copilot · 2026–present

Evaluation frameworks for agents that operate across sessions rather than single ones. Two artifacts: a trajectory-level eval taxonomy that classifies failures by when in the task arc they occur (not just whether the final output is correct), and an epistemic integrity framework for what agents should know, commit to, and be uncertain about across the session-to-long-running transition.

Microsoft M365 Copilot · 2025–present

Designed a four-tier memory taxonomy (semantic, episodic, procedural, prospective) for agentic systems serving tens of millions of users. The prospective tier — forward-looking task memory — is the original contribution. Paired with a two-stage generation architecture that reduced hallucination and GPU costs significantly.

Swarm-Based Evaluation Pipeline
Microsoft M365 Copilot · 2026–present

A multi-agent eval system with specialized roles: Proposer, Critic, Scorer, Synthesizer. Compressed two weeks of human annotation to one agent-day without sacrificing coverage. Built to run continuously, not just at release gates.

Tool & Skill Quality Framework ↗ Microsoft M365 Copilot · 2026–present

Three-layer framework for assessing agentic tool quality: contract quality, execution success rate, and orchestration triggering accuracy. Designed to be measurable, not just principled.

Microsoft Research · prior

Built knowledge graphs at billion-node scale for academic literature. Adopted by the OECD and the Stanford AI Index as infrastructure for science-of-science research. Learned that knowledge representation is never just a data problem — it is always also a question of what you think knowledge is. More selected works: A web-scale scientific taxonomy; Science-of-science studies.

More to come · Updated irregularly · Last touched April 2026