The Agentic Runtime Layer

There is a question practitioners keep asking in slightly different forms: Why do long-running agents fail in ways that short-lived ones don't? The answer isn't model quality. It isn't prompt engineering. It's that the architectural substrate — the runtime layer responsible for what an agent knows, what it can do, and how it decides — wasn't designed for duration. It was designed for a session.

This post offers a framework for thinking about that substrate systematically. Not a product design, not a roadmap — a map of the problem space, with working solutions where they exist and honest placeholders where they don't.

The Problem

Runtime Architecture Is the Missing Layer

Most discourse on agentic AI organizes around two poles: model capability and application design. The model handles language understanding, reasoning, and generation. The application handles business logic, UX, and integration. What sits between them — the runtime layer — often isn't named at all.

That gap matters because agents are not stateless request-response systems. They accumulate context, invoke external capabilities, make commitments that outlive a single turn, and increasingly run for hours or days. Each of those properties introduces a failure class that neither the model nor the application layer is equipped to handle alone: context that goes stale, tools that are called but cannot be trusted, decisions that cannot be corrected.

The runtime layer is the locus of all three.

The Framework

A 3×2 Grid

The runtime layer decomposes along two axes. The first is functional: what an agent does at runtime breaks into three domains — Knowing (what it holds as context), Doing (what capabilities it can invoke), and Deciding (how it governs the use of both over time). The second is operational scope: session-based agents operate within bounded, stateless interactions; persistent / long-running agents maintain state across time, users, and tasks.

The six cells that result are not equally understood. The session-based column has working solutions. The persistent column is largely frontier territory.

Domain	Session-Based	Persistent / Long-Running ◆
Knowing	Memory generation & retrieval within a session Two-stage pipeline (async generation + inference-time selection) is now broadly understood. The open engineering problem is the source: generating memory from non-human-bot interactions — MCP tool outputs, autonomous task traces, multi-agent exchanges — rather than from conversation transcripts alone. ↗ Two-Stage Memory Architecture ↗ Four-Tier Memory Taxonomy	Cross-session belief validity; longitudinal coherence When agents span sessions, staleness is the default state, not an edge case. The system must detect when standing memory no longer reflects the world, not just when a session ends. Frontier
Doing	Tool/skill selection & execution within a task E2E quality is a multiplicative function of three layers: contract quality (clean, non-overlapping descriptions), execution quality (the tool does what it claims), and orchestration-time selection. Optimizing only selection — where most current effort concentrates — yields limited gains if the upstream layers remain unaddressed. ↗ Quality Architecture for Skills & Tools	Capability governance across tasks; evolving registries Over long task horizons, tools version, deprecate, and fail in ways that session-scoped frameworks don't anticipate. Authorization semantics — did the intent that called this tool still hold three steps later? — become a first-class concern. Frontier
Deciding	Planning & reasoning within a turn sequence Three patterns dominate: ReAct (interleaved think/act/observe), Plan-then-Execute (upfront decomposition), and Reflexion (self-critique and replan). The less-understood part is the discipline separation: prompt engineering governs how the LLM is instructed; context engineering governs what it reasons over; harness engineering governs how the loop runs and recovers from failure. Conflating these produces failure modes that are nearly impossible to diagnose.	Long-horizon steerability; goal validity & human interruption The hardest cell. Steerability — as an architectural property of the runtime layer, distinct from training-time alignment — requires a live representation of the agent's intended future actions as the designated insertion point for human interrupts. Without it, correction is destructive rather than additive. Frontier

On terminology: "Long-running" dominates in engineering practice (SWE-Bench Pro, Anthropic's usage guidelines); "always-on" better captures the persistent-state intent but implies continuous execution that doesn't always hold. Persistent / long-running bridges both.

Session-Based · Knowing

Where the Memory Innovation Actually Lives

The two-stage memory pipeline — asynchronous generation followed by inference-time selection — is now a shared engineering assumption in production systems. Separating generation from retrieval, and amortizing generation cost across many downstream queries, is well-understood. That part is no longer the frontier.

The open problem is the source of memory generation. When agents interact through human-bot conversation, every exchange is a candidate memory artifact — the transcript is the trace. But as agents increasingly operate through MCP tool calls, multi-agent exchanges, and autonomous task execution, the interaction record is no longer a conversation. It is a structured artifact stream: tool outputs, plan traces, intermediate states, inter-agent messages.

The key shift: Human-bot systems can afford to process the full transcript — it's computationally tractable and the signal is relatively dense. Non-human-bot interactions at scale cannot take the same approach: a raw tool trace is high-volume, low-signal-density, and often structurally ambiguous. The system needs to decide, agentically, what from this artifact stream is worth distilling into standing memory — and that decision requires understanding task intent, execution outcome, and downstream relevance. The published, production-grade architecture for this specific problem remains sparse in the literature, which is a genuine gap relative to the conversational memory case.

The four-tier memory taxonomy — semantic, episodic, procedural, and prospective — provides the target structure for this distillation. Prospective memory (deferred commitments, pending decisions, upcoming deadlines) is the tier most absent from deployed systems and the one that matters most as agents persist across sessions. The architecture and taxonomy are detailed in linked papers.

Session-Based · Doing

Three Layers, One Metric

Tool and skill selection is usually framed as a retrieval problem: given a query, rank available capabilities and invoke the best match. This framing misses two upstream failure modes that selection cannot compensate for.

The first is contract quality: if tool descriptions overlap, are ambiguous, or fail to specify scope, no retrieval mechanism produces clean selection — semantic noise propagates downstream. The second is execution quality: a correctly selected tool that fails silently, or succeeds on a narrower version of what it claims, degrades E2E task completion in ways nearly impossible to attribute post-hoc.

P(E2E success) = P(clean contract) × P(tool succeeds | called) × P(right tool selected)

All three layers are multiplicative. Optimizing only the third factor is the most common architectural mistake in production agentic systems. Recent benchmarks underscore the severity: across 550+ real-world tools, even frontier models achieve below 60% selection accuracy under realistic noisy conditions. The problem is not model capability — it is that upstream contract quality makes the selection problem unnecessarily hard. The full three-layer framework is in a linked paper.

Session-Based · Deciding

Planning, Reasoning, and the Control Loop

The deciding layer is the control flow that bridges Knowing → Doing → back to Knowing, iteratively, toward a goal. The three dominant patterns — ReAct, Plan-then-Execute, and Reflexion — are well-known. The less-examined issue is the discipline separation that determines which failure mode you're debugging when something goes wrong.

Prompt engineering governs how the LLM is instructed. Context engineering governs what information it reasons over — what gets retrieved from the Knowing layer, how it's compressed, in what order it's injected. Harness engineering — Mitchell Hashimoto's term for the orchestration loop infrastructure — governs how the loop runs and recovers from failure. A system that has wrong tool calls has a prompt engineering problem. A system that hallucinates because it lacks relevant facts has a context engineering problem. A system that gets stuck in infinite retry loops has a harness engineering problem. These look similar from the outside and require different interventions.

The four-tier memory taxonomy maps directly onto this layer: semantic memory feeds the planner; episodic memory feeds the step-level reasoner; procedural memory is the tool schema the planner reasons over; and prospective memory is the plan itself — the agent's live representation of its intended future actions. This mapping is why prospective memory bridges the session-based and persistent Deciding cells: it is the primitive that makes long-horizon steerability tractable.

Persistent / Long-Running

The Frontier Column

When agents maintain state across sessions, the problems in each row don't merely scale — they change qualitatively. Stale context becomes a structural hazard. Tool authorization outlives the intent that triggered it. Decisions must remain correctable at arbitrary points in long trajectories.

The core difficulty: in a session-based system, the context window is the audit surface — everything the agent knows, is doing, and has decided fits in one bounded artifact. In a persistent system, the agent's epistemic state is distributed across memory stores, tool execution histories, and partially completed plans, none of which automatically stays coherent as time passes, models update, or user intent evolves.

These cells are explored in a companion paper currently in preparation: Agent Epistemic Integrity: A Unified Framework for Knowing, Doing, and Deciding Across the Session-to-Always-On Transition. The core argument: the three domains are not independent problems with independent solutions. They are three faces of one invariant problem, and the session-to-persistent transition is the stress test that reveals what each domain truly requires.

Cross-Cutting Constraints

Two Horizontal Layers

Model Layer

A source of instability, not a stable substrate

Foundation models change — through updates, fine-tuning, and replacement. A runtime architecture with epistemic logic embedded in prompts is re-engineered every time the model changes. Stable systems define epistemic properties at the system level — the uncertainty audit trail, the memory retrieval contract, the tool selection criteria. The model executes against those definitions; it does not embody them.

Economics Layer

A real-time optimization constraint

Inference costs are neither fixed nor ignorable. A principled runtime allocates compute as a function of confirmed grounding: steps with high epistemic confidence proceed with shallow deliberation; steps with high assumption density trigger deeper reasoning or human escalation. This is uncertainty-gated execution — a property of the architecture made operational, not a heuristic applied post-hoc.

Why This Framing Matters Now

The Transition Is Underway

Production deployments are moving from single-turn assistants to persistent agents that manage calendars, draft and send communications, execute code, and take consequential actions over hours or days. The failure modes of these systems are not model failures. They are runtime failures — failures in the layer responsible for knowing, doing, and deciding.

The field currently has frameworks for memory (fragmented), frameworks for tool use (nascent), and frameworks for alignment and oversight (largely theoretical at the deployment layer). It does not have a unified account of how these three domains constrain one another, or why that constraint tightens specifically at the session-to-persistent boundary. The 3×2 grid is intended to make that constraint visible.

The session-based cells have working solutions. The persistent cells have the right questions. The Deciding row — at both timescales — is where the most consequential unsolved problems live.

An agent that pursues an outdated goal with high confidence and low per-step error rate is, from a runtime integrity standpoint, failing — even if each individual action scores well. The invariant problem is not any single cell. It is whether what the agent knows, does, and decides remains coherent, correctable, and aligned with user intent across time.

That property has a name: Agent Epistemic Integrity. The work of operationalizing it — across all six cells, both horizontal layers, and the session-to-persistent transition — is what comes next.

Research notes, half-baked ideas. Probably overthought, definitely over-architected.