Iris Shen

Researcher Systems Builder Agent Infrastructure

My work focuses on the infrastructure behind long-running AI agents: memory, evaluation, orchestration, and runtime design.

On intelligent systems

How do intelligent systems know what they need to know, at the right moment — without being told?

Agent Epistemic Integrity →

On human becoming

How do we become who we are — across everything we've lived through, and mostly forgot?

trace ↗ builder's note →

Three careers, one question

Now

Agentic AI Infrastructure

Runtime layer at scale — memory, orchestration / tool and skill quality, planning, long-running task lifecycle. Tens of millions of users. The hard problems are not the models.

Before

Knowledge Graphs · Microsoft Research

Billion-scale academic knowledge graphs — adopted by the OECD and the Stanford AI Index. Knowledge representation is always also a question of what you think knowledge is. Microsoft Academic Graph; A web-scale scientific taxonomy; Science-of-science studies.

Before that

Operations Research · PhD USC

The gap between a formally correct solution and a useful one is almost always a design problem, not a math problem. Vehicle routing problem; Inventory systems.

The thread: how a system knows what it needs to know — and acts reliably on it.

What I build and research

Agent Epistemic Integrity ↗ Paper →

An architectural framework for how long-running agentic systems keep beliefs, actions, and commitments coherent and correctable across the knowing / doing / deciding axes. [arXiv →] ID: 2606.04017

FT-Lab — when fine-tuning works, and when it doesn't ↗ Project ↗

Local Apple Silicon fine-tuning lab. Trace track (SFT + DPO): tier accuracy 0/12 → 12/12. collab-eval track (four SFT runs, one DPO run, none promoted): row preservation is a generation-time policy that token-level supervision cannot reach. iris-ft-lab ↗ · failure modes → · collab-eval → · row preservation →

Trace — witness a soul's emerging ↗ Project ↗

A local-first longitudinal agent that succeeds only when you feel witnessed — never when a task is completed. trace ↗ · builder's note →

Researcher and builder's notes Notes →

Frameworks in formation, questions still unresolved — thinking out loud at the seams between levels.

What I think about

Memory

Memory as a first-class architectural primitive — not a bolt-on. Prospective memory as a steerability surface. The four-tier taxonomy: semantic, episodic, procedural, prospective.

Epistemic Integrity

How do long-running agentic tasks maintain reliable self-knowledge across the knowing / doing / deciding axes? What breaks when sessions end and agents restart? white paper → · practitioner's note →

Evaluation

Evaluation frameworks that don't lie to you — from trajectory-level eval taxonomies to grader harnesses that catch reward-design failure modes.

Orchestration

Why orchestration is harder than it looks — tool quality, skill triggering, the compound failure rate of pipelines nobody stress-tested end to end. tool and skill quality →

Elsewhere

GitHub ↗ LinkedIn ↗ Google Scholar ↗ Writing ↗ Email (iris.shen.ai at gmail dot com)

Research notes, half-baked ideas. Relentlessly overthought, definitely over-architected.