My work focuses on the infrastructure behind long-running AI agents: memory, evaluation, orchestration, and runtime design.
How do intelligent systems know what they need to know, at the right moment — without being told?
Agent Epistemic Integrity →How do we become who we are — across everything we've lived through, and mostly forgot?
Three careers, one question
Runtime layer at scale — memory, orchestration / tool and skill quality, planning, long-running task lifecycle. Tens of millions of users. The hard problems are not the models.
Billion-scale academic knowledge graphs — adopted by the OECD and the Stanford AI Index. Knowledge representation is always also a question of what you think knowledge is. Microsoft Academic Graph; A web-scale scientific taxonomy; Science-of-science studies.
The gap between a formally correct solution and a useful one is almost always a design problem, not a math problem. Vehicle routing problem; Inventory systems.
The thread: how a system knows what it needs to know — and acts reliably on it.
What I build and research
An architectural framework for how long-running agentic systems keep beliefs, actions, and commitments coherent and correctable across the knowing / doing / deciding axes. [arXiv →] ID: 2606.04017
Local Apple Silicon fine-tuning lab. Trace track (SFT + DPO): tier accuracy 0/12 → 12/12. collab-eval track (four SFT runs, one DPO run, none promoted): row preservation is a generation-time policy that token-level supervision cannot reach. iris-ft-lab ↗ · failure modes → · collab-eval → · row preservation →
A local-first longitudinal agent that succeeds only when you feel witnessed — never when a task is completed. trace ↗ · builder's note →
Frameworks in formation, questions still unresolved — thinking out loud at the seams between levels.
What I think about
Memory
Memory as a first-class architectural primitive — not a bolt-on. Prospective memory as a steerability surface. The four-tier taxonomy: semantic, episodic, procedural, prospective.
Epistemic Integrity
How do long-running agentic tasks maintain reliable self-knowledge across the knowing / doing / deciding axes? What breaks when sessions end and agents restart? white paper → · practitioner's note →
Evaluation
Evaluation frameworks that don't lie to you — from trajectory-level eval taxonomies to grader harnesses that catch reward-design failure modes.
Orchestration
Why orchestration is harder than it looks — tool quality, skill triggering, the compound failure rate of pipelines nobody stress-tested end to end. tool and skill quality →
Elsewhere
Research notes, half-baked ideas. Relentlessly overthought, definitely over-architected.