Agent Epistemic Integrity: A Framework for Knowing, Doing, and Deciding Across the Session-to-Long-Running Transition

Iris Shen · June 2026 · [arXiv ↗](arXiv ID: 2606.04017)

Abstract

As agentic systems move from single-session interactions to long-running operation, they encounter coupled failure modes that are poorly captured when memory, tool state, and steerability are treated as separate concerns. This paper introduces Agent Epistemic Integrity (AEI) as a conceptual architectural framework for reasoning about that coupling. AEI asks whether what an agent knows, does, and intends remains coherent, inspectable, and correctable across time.

The paper makes three claims. First, the session-to-long-running transition is the stress test that turns freshness, partial execution, and goal validity from hidden assumptions into explicit state-management problems. Second, prospective memory — a durable representation of forward commitments — provides a practical correction surface for human steering, though it is not by itself sufficient for safety or correctness. Third, long-running agents require trajectory-level evaluation rather than turn-level success alone.

The contribution here is primarily conceptual. This paper does not yet offer a complete formal state model, canonical schema, or proof system for AEI. Instead, it offers a systems framing that links persistent memory, capability state, and steerability into a single runtime integrity problem.

1Introduction

Consider a long-running agent coordinating a vendor contract renewal over three weeks.

In week one, the agent is authorized to renew the contract by May 15 with a 10% cost-reduction target and no change to the existing SLA. In week two, another team adds a new security-certification requirement to the project documentation, but no one restates that requirement in the active chat. In week three, the agent prepares to retry an earlier proposal step without realizing that a prior email has already committed external state.

Nothing in this example is exotic. The agent can retrieve relevant text, call tools correctly, and still fail in a structurally predictable way. It may act on beliefs that were once true but are no longer current. It may repeat actions without reasoning over the side effects that already occurred. It may continue pursuing a goal whose validity conditions have changed without an explicit surface for correction.

This is the setting that motivates Agent Epistemic Integrity.

In session-bounded systems, many of these issues are partially hidden by the session boundary itself. Context is assumed to be fresh. Tool invocations are assumed to be locally bounded. Goals are assumed to remain valid for the duration of the interaction. As systems move into long-running operation — persistent assistants, multi-day workflows, background delegation, cross-session task execution — those assumptions stop holding. The result is not simply "more memory needed." It is the emergence of a coupled systems problem spanning memory, action, and steering.

The field has responded with capable partial frameworks. Memory architectures address retention and retrieval [CoALA, MemGPT]. Action frameworks address reasoning, tool use, and self-correction [ReAct, Reflexion]. Governance frameworks address oversight and interruptibility [OpenAI]. What the field lacks is a clear account of how these three domains constrain one another — and why that constraint becomes more visible, specifically and structurally, as systems transition from session-based to long-running operation.

This paper argues that the right unit of analysis is not memory alone, nor planning alone, nor governance alone, but their interaction. AEI is offered as a framework for that interaction. It makes three claims. First, the session-to-long-running transition turns freshness, partial execution, and goal validity from hidden assumptions into explicit state-management problems. Second, prospective memory provides a practical correction surface for human steering across that transition. Third, long-running agents require trajectory-level evaluation rather than turn-level success alone.

2The Framework: Agent Epistemic Integrity

2.1A Conceptual Architectural Property

Agent Epistemic Integrity (AEI) is the architectural property that an agent's active state remains sufficiently coherent, inspectable, and correctable for the actions it takes over time. The phrase "active state" is doing real work here. It includes not just retrieved facts, but the system's current uncertainty annotations, its understanding of what tools have already been used and with what effects, and its representation of what it is committed to doing next. A system may produce fluent outputs while lacking AEI if it cannot expose the state that justifies those outputs to inspection and correction.

AEI is therefore not a claim about omniscience. It is a claim about calibrated operation under incomplete information. A system can be uncertain and still maintain AEI if it behaves proportionally to that uncertainty. Conversely, a system can violate AEI even when individual answers sound correct, if it acts as though stale, partial, or weakly grounded state were fully reliable.

AEI is a system-level framing, not a model-level guarantee. It is not ensured by any individual component — not the retriever, not the planner, not the tool executor — and it cannot be evaluated on the basis of individual outputs. Its value is in directing architectural attention toward how the system represents epistemic state and exposes that representation for correction. No amount of fine-tuning produces AEI in a system whose architecture lacks the inspection and correction surfaces it requires.

2.2Three Invariant Domains

The problem space of agentic systems decomposes, at the architectural level, into three domains.

Knowing is memory and context management — what the agent holds true about the world, itself, and its history. The problem is not storage but coherence over time: non-contradictory beliefs, recognition of supersession, and traceable provenance.

Doing is tool and capability management — what actions the agent can take, under what conditions, and what the cumulative effect of its invocations has been. The challenge is not enumeration but runtime state: capability availability is not binary, effects are not idempotent, and degradation is frequently unannounced.

Deciding is planning and reasoning — how the agent selects and sequences actions given its beliefs and capabilities. This spans single-step selection, plan decomposition, and the meta-level question of when to pause and solicit input.

These three domains are invariant in the sense that any agentic system must solve some version of each, and no solution in one substitutes for a solution in another. They correspond to the three components of a purposive system: representation, action, and deliberation.

2.3The 3×2 Grid

The framework organizes these three domains against two deployment paradigms: the session-based paradigm, in which the agent operates within a bounded interaction window, and the long-running paradigm, in which the agent persists across arbitrary time horizons and may act during periods of reduced supervision. Figure 1 illustrates the six cells of the framework and the new architectural primitive each transition requires.

Figure 1. The AEI 3×2 grid. The session-based column reflects what current architectures implicitly assume — and the assumptions each column silently encodes are shown in italic. The long-running column is where those assumptions fail, requiring new architectural primitives. Prospective memory (§3.4) is the unifying primitive across all three right-column cells. AEI is the architectural property governing the full grid.

2.4Two Horizontal Constraints

Two concerns cut across all six cells and represent qualitatively distinct constraints on the AEI problem.

The Model Layer is the source of instability that any architecture must absorb. Foundation model behavior is not static: models are updated on schedules that may not be communicated to system operators. Any system capability that depends on specific model behavior — uncertainty calibration, tool-use syntax, reasoning patterns — is subject to silent regression. AEI requires that architectures treat model-layer instability as a design invariant, not an edge case.

The Economics Layer is the optimization constraint that any real deployment must satisfy. Context tokens are not free. Long-horizon tasks accumulate retrieval, reranking, and state-reconstruction costs across sessions. The economics layer does not ask which design maximizes epistemic integrity in principle; it asks which design maximizes integrity subject to a cost budget. This hard constraint shapes which architectures are actually deployable.

2.5The Stable What / Volatile How Principle

The stable what is the invariant requirement — the architectural obligation that persists regardless of implementation. For Knowing: the agent must maintain non-contradictory beliefs and detect when they have been superseded. For Doing: it must track the cumulative state of its capability invocations and reason about safe resumption. For Deciding: it must maintain an explicit history of its intent states and surface changes for human review.

The volatile how is the mechanism — the specific implementation that satisfies the requirement today. A timestamp-weighted vector store, a Bayesian revision algorithm, a TTL-based staleness flag. Each is a reasonable answer to the stable what; none is the only answer. When models are upgraded, retrieval infrastructure changes, or better algorithms emerge, the how changes. The stable what does not.

A system organized around the stable what will survive those changes with its core behavior intact. A system organized around today's specific mechanism will require partial or complete redesign each time the mechanism evolves. AEI is offered as a stable what for the agentic domain: the mechanisms that satisfy it are a research and engineering agenda; the requirement itself is fixed.

3The Session-to-Long-Running Transition

The session boundary is not a UX convenience. It is an architectural assumption baked into every layer of how agents are currently designed: how they retrieve context, how they invoke capabilities, and how they represent goals. Within a session, that assumption does its work silently — it absorbs complexity that would otherwise have to be handled explicitly. Remove it, and the complexity does not disappear. It surfaces, simultaneously, across all three domains of epistemic integrity.

A session is, in architectural terms, a freshness guarantee. When an agent begins a new session, its retrieved beliefs are implicitly current — staleness that predates the session boundary is outside the agent's concern. Its invoked tools are stateless with respect to prior runs — whatever happened before is not the agent's problem to reconcile. Its goal is stable by assumption — the user stated it moments ago, and nothing has had time to change it. The session is a coherence envelope: it bounds the space in which the agent must reason about time, consistency, and intent. Long-running operation tears that envelope away. The agent must supply its own coherence — continuously, across all three domains.

Domain	Session-Scoped	Long-Running	New Primitive Required
Knowing	Retrieve what is relevant now; freshness assumed	Detect staleness; revise superseded beliefs	Belief revision
Doing	Select and invoke the right capability; state is ephemeral	Track cumulative state; reason about safe resumption and retry	Capability state management
Deciding	Plan the next step; goal is stable	Track goals' validity conditions over time; surface intent drift	Goal lifecycle management

3.1Knowing: From Retrieval to Belief Revision

In a session-scoped system, the retrieval problem is a relevance problem: find the context most pertinent to the current query and surface it. Freshness is not a retrieval criterion because the session boundary enforces it structurally — nothing retrieved from a session that began moments ago can be meaningfully stale.

Long-running operation dissolves this guarantee. An agent running for days, weeks, or months carries beliefs about user preferences, organizational state, prior decisions, and external facts — beliefs that were accurate when formed and may not be accurate now. The agent has no external freshness signal to rely on. It must treat its own belief state as an object of ongoing epistemic management: tracking provenance, estimating decay, detecting inconsistency across newly arriving evidence, and revising beliefs when the evidence warrants.

This is a qualitatively different operation from retrieval. Retrieval asks which memory is most relevant. Belief revision asks which memories are still true — and what to do when the answer is no. Without a mechanism for belief revision, a long-running agent does not accumulate knowledge over time; it accumulates drift, surfacing outdated beliefs with the same confidence it would apply to fresh ones.

3.2Doing: From Invocation to Capability State Management

The tool-use model underlying most current architectures is implicitly stateless. A capability is selected, invoked, and its result consumed within the scope of a single reasoning step. If the invocation fails, it either retries immediately or surfaces an error. There is no persistent record of partial progress, no accounting for what a prior invocation may have already accomplished, and no mechanism for reasoning about the relationship between current and prior invocations.

Within a session, this is largely acceptable — sessions are short, tools are fast, and failure modes are recoverable. Long-running operation changes the failure model completely. A task spanning hours or days may involve dozens of capability invocations. Partial completions are not aberrations; they are the normal operating condition. Retries do not reset cleanly — they must be aware of what has already been effected in the world. Tools that write external state — sending emails, modifying documents, updating records — are no longer idempotent in the session sense, and naive retry logic can compound errors rather than recover from them.

The required primitive is capability state management: an explicit, durable representation of what each tool invocation has accomplished, what remains, and what constraints govern resumption or retry. Without it, the long-running agent cannot distinguish between a task that has not started and one that is half complete, nor can it reason about whether a given action is safe to repeat.

3.3Deciding: From Step Planning to Goal Lifecycle Management

In a session-scoped agent, the goal is a fixed point. The user stated it at the beginning of the session; it has not changed; the agent's job is to make progress toward it. Planning is a local operation — identify the next best action given the current state.

Long-running operation transforms the goal from a fixed point into a trajectory through time. A goal valid when issued may be partially satisfied, fully superseded, or simply expired by the time the agent acts on it. Organizational priorities shift. User circumstances change. Deadlines pass. New information renders prior goals incoherent. A long-running agent that cannot reason about the lifecycle of its own goals will faithfully execute against objectives that are no longer valid — completing tasks no one needed, optimizing for outcomes no one wants.

Goal lifecycle management is the primitive that addresses this: the agent must maintain an explicit history of its intent states, track the conditions under which each goal was issued, detect when those conditions have changed, and surface that detection for human review rather than silently continuing.

3.4Prospective Memory as the Unifying Primitive

All three escalations — from retrieval to belief revision, from invocation to capability state management, from step planning to goal lifecycle management — share a common structural requirement. Each requires the agent to maintain a live, queryable model of its own intended future state: what it believes it will need to know, what it expects to do, and what it is committed to accomplishing. This is the function of prospective memory.

The term draws from cognitive psychology [Brandimonte], where it refers to the capacity to remember to perform an intended action at a future time or in response to a future cue. The engineering construct proposed here is related in motivation but broader in scope: a first-class, serializable, inspectable data structure that encodes not just pending actions but the validity conditions, provenance, and dependencies of the agent's forward commitments — and exposes them through a query interface to both the execution engine and the human oversight layer. The cognitive construct motivates the name; the architectural specification stands on its own terms.

Prospective memory is the natural insertion point for human steering. What a human most needs to inspect and correct is not the agent's past actions — which cannot be undone — but its forward intentions, which can. If those intentions are represented explicitly, they are queryable, auditable, and correctable before they commit. If they exist only implicitly within a chain of reasoning steps, human steering requires reconstructing the agent's intent from the outside — slow, error-prone, and architecturally fragile.

Steerability is not a property of the user interface. It is a property of the memory architecture. An agent designed from the start with prospective memory as a first-class primitive has structural support for steerability. An agent for which steering is added afterward is steerable only by workaround.

4Implications for System Design

Current agentic systems are built around task completion. Their internal architecture reflects this: a planner emits steps, a tool executor runs them, and outputs accumulate until a termination condition is met. Epistemic state is implicit — carried in context, shaped by model behavior, and largely invisible to the system itself. When something goes wrong, post-hoc inspection of logs may reveal what happened but rarely why the agent was confident enough to proceed. This is not a logging gap. It is a design gap.

A system architected around epistemic integrity inverts the priority. Task output is still the product, but epistemic state is a first-class citizen of the runtime — tracked, surfaced, and made actionable at every layer. Three prescriptions follow, one per domain. A system that implements only a subset has added instrumentation, not epistemic integrity.

4.1Prescription 1 (Knowing): Uncertainty as a First-Class Output

Every task output should be accompanied by an uncertainty audit trail — a structured, machine-readable record that distinguishes three epistemic modes across the trajectory: what the agent knew with high confidence, what it inferred from incomplete context, and what it assumed without verification. This is not logging. Logs record events; the uncertainty audit trail records the agent's epistemic posture at the moment of decision. These are different artifacts with different consumers.

Each substantive action in a task — a retrieval, a synthesis step, a tool invocation, a commitment to a subgoal — should emit a tagged epistemic annotation:

Tag	Meaning	Example annotation
`confirmed`	Grounded in retrieved or directly provided evidence	"Project deadline is April 30 — confirmed from calendar retrieval"
`inferred`	Derived by reasoning from confirmed facts	"Given the deadline and today's date, the task is running late"
`assumed`	Taken as true without available grounding	"Assuming vendor preferences from the prior session still apply"

Downstream systems — human reviewers, orchestrating agents, risk management layers — can then operate on the audit trail as a first-class input: escalating on assumption density, routing high-inference steps to verification loops, or presenting the trail to the user as part of the task deliverable. In the long-running setting, where tasks span extended time horizons and may resume after model state has been reconstructed, the audit trail provides epistemic continuity that context windows alone cannot. A resuming agent that inherits an uncertainty audit trail knows not just where it left off but how confident to be about what it recorded.

4.2Prescription 2 (Doing): Capability State as a Durable Runtime Artifact

Human interruption of agentic tasks is currently destructive by default. Without a well-defined insertion point, an interrupt either cancels in-flight work, corrupts task state, or is queued until the agent reaches a natural pause — at which point the moment for correction may have passed. The root cause is architectural: most systems have no explicit representation of what each tool invocation has accomplished, what remains, or what constraints govern safe resumption. Interruption has nowhere to land, and neither does a retry.

Capability state — the execution record of each tool invocation — should be a durable, queryable artifact maintained by the system rather than reconstructed from logs after the fact. At minimum, this record should encode: what the invocation was asked to do, what it has completed, what remains, whether resumption is safe or requires human review, and what side effects have already been committed to external systems. A system with capability state management can distinguish a safe retry from a dangerous one; without it, the agent is guessing.

This closes the Doing gap with the same structural logic that the uncertainty audit trail applies to Knowing. The form differs; the principle — making implicit state explicit and inspectable — is the same.

4.3Prescription 3 (Deciding): Prospective Memory as the Steerability Surface

Prospective memory — the agent's live model of its own planned commitments — provides the steerability surface that the Deciding domain requires in the long-running setting. If the agent maintains an explicit, queryable representation of what it intends to do next (and why), then a human interrupt can target that representation directly: canceling a specific intended action, modifying a goal parameter, or injecting a new constraint before execution rather than after. This is the difference between corrective steering and emergency braking.

Engineering this surface requires that prospective memory be a durable, inspectable artifact — not an ephemeral planning state inside a single model call. It should be serializable, versioned, and accessible to both the human interface layer and the execution engine. An agent that cannot surface its own intentions cannot be steered; it can only be stopped.

Existing evaluation frameworks are predominantly turn-level: they measure whether a given output was correct given the preceding input. For agentic trajectories, this is systematically misleading. A sequence of individually plausible steps can constitute a catastrophically flawed trajectory — one in which uncertainty accumulated silently, assumptions went unflagged, and goal validity was never reconfirmed despite shifting conditions. Epistemic integrity therefore requires trajectory-level evaluation: measuring not just whether outputs were correct, but whether the epistemic journey that produced them was sound.

4.4Cross-Cutting: Uncertainty-Gated Execution and Model Independence

Model independence. When the uncertainty audit trail, capability state record, and prospective memory surface are defined at the system level — not inside the model — they persist across model upgrades, swaps, and fine-tuning cycles. The stable what does not change when the model changes; only the volatile how by which epistemic state is produced changes. This is the stable-what principle made operational against the model layer.

Uncertainty-gated execution. The audit trail provides a principled basis for compute allocation: steps where the agent has high confirmed grounding proceed with shallow deliberation; steps where assumption density is high trigger deeper reasoning or human escalation. This is not a heuristic — it is a property of the audit trail made actionable. A system with epistemic integrity is, by construction, a system that avoids wasteful deep deliberation on steps it is already equipped to take — directly addressing the economics constraint from §2.4.

5Worked Example: A Multi-Week Vendor Negotiation

To make the framework concrete, consider a long-running agent coordinating a vendor contract renewal over three weeks — drafting correspondence, scheduling meetings, maintaining state across conversations, and surfacing decisions for human review.

Week 1

The user authorizes a goal: Renew the XYZ contract by May 15; target 10% cost reduction; maintain current SLA terms. The agent writes this into prospective memory as a goal object with a deadline, two explicit validity conditions (cost target, SLA), and an issuance timestamp. The uncertainty audit trail records the initial state of each condition as confirmed.

Week 2

The user, in a meeting with a separate team, learns of a new internal requirement: XYZ must commit to a security certification as part of any renewal. The user updates a project document but does not explicitly re-instruct the agent.

In a session-bounded architecture, this update is invisible to the agent — it has no session in which to receive it. In a long-running architecture without AEI, the agent continues executing the original goal, oblivious to the new constraint. It surfaces a successfully negotiated renewal without the certification term. The user discovers the gap too late to renegotiate without relational cost.

With AEI, three things happen differently. Belief revision, triggered by the agent's next ingestion pass over the project document, detects inconsistency between the original goal's validity conditions and newly observed evidence. The prospective memory entry for the goal is flagged: original conditions are marked potentially superseded. The agent does not autonomously re-scope the goal — that is outside its authorization. Instead, it emits a steerability signal: "Original goal conditions may be superseded by new requirement in project document; confirm scope before proceeding." The uncertainty audit trail records the assumption the agent would otherwise have made (assumed: original goal conditions unchanged) as an explicit annotation, making the decision pathway legible to any reviewer.

Week 3

An earlier tool invocation — a proposal email sent to the vendor — committed external state. The user now requests a revised proposal with updated pricing. A naive retry would issue a duplicate proposal, confusing the vendor. Capability state, however, records that the email was sent, with a message ID, and that any follow-up requires either a correction message or an explicit retraction. The execution path branches accordingly. The capability state record is updated to reflect the revised commitment chain.

The scenario is deliberately mundane. That is the point. AEI does not address exotic failure modes. It addresses failure modes that become systemic when session-bounded architectures meet multi-week operation, and it is in mundane workflows that the costs compound fastest and with the least visibility.

6Limitations and Open Problems

AEI is meant as a design framework, not a complete theory. Several limits should be made explicit.

6.1This Paper Is Not Yet a Full Formal Specification

This draft does not provide a canonical state tuple, update algebra, or proof obligations for AEI. It offers a conceptual decomposition and a set of architectural consequences. A more formal specification — defining precise state models, transition semantics, and measurable integrity conditions — remains future work. The framework is offered as a systems framing that sharpens the design target; it does not yet deliver the formal machinery needed to verify whether a given system meets it.

6.2Better State Surfaces Do Not Fix Goal Quality

AEI assumes there is some meaningful goal to maintain and revise. It does not solve the problem of poorly specified or misaligned goals at issuance time. A system with perfect AEI can faithfully pursue a goal that was misspecified from the start. AEI improves traceable execution; it does not substitute for intentional alignment.

6.3Calibration Remains Empirically Open

The uncertainty audit trail presupposes that agents can produce meaningful uncertainty estimates. Recent work on large language model calibration [Kadavath] finds that even instruction-tuned models can express high confidence in incorrect conclusions and hedge excessively on well-grounded ones. An uncertainty trail is only as useful as the uncertainty estimates attached to it. AEI creates the infrastructure in which calibration matters; it does not guarantee that calibration is solved.

6.4Multi-Agent Composition Is Harder Than Single-Agent Integrity

AEI is developed primarily for a single agent operating with a defined set of capabilities and a coherent principal hierarchy. Real deployments increasingly involve networks of specialized agents, sub-agents spawned dynamically, and orchestrators that are themselves model-driven. In such settings, epistemic integrity becomes a compositional property. One agent's clean state can still be polluted by another agent's stale or overconfident outputs. This extension is a necessary direction for future work.

6.5Explicit State Introduces Cost and Attack Surface

Persisting belief state, capability state, and prospective memory adds storage, latency, synchronization, and security burdens. It also creates new objects that can be tampered with if poorly secured: a poisoned belief can propagate through the audit trail as a confirmed annotation, and a manipulated prospective memory entry can redirect behavior while appearing legitimate to a human auditor [Greshake]. Those costs and risks are real. They are part of the engineering tradeoff, not a reason to avoid making state explicit altogether.

6.6The Cold Start Problem for Prospective Memory

How does a long-running agent bootstrap a coherent prospective memory representation when first deployed, when transitioning between tasks, or when recovering from unexpected state loss? The framework identifies prospective memory as the right primitive without specifying how its contents should be initialized, validated, or recovered. That operational specification — a schema and lifecycle protocol for prospective memory objects — is a concrete engineering challenge this paper leaves open.

6.7Adoption Is a Coordination Problem

The primitives proposed here — audit trail, capability state record, prospective memory surface — are architectural, not standards. Their practical value depends on adoption across the ecosystem: tool developers who surface capability state, orchestrators that consume uncertainty annotations, evaluation platforms that operate on trajectory-level inputs. A single system that implements these primitives in isolation gains internal benefits but cannot participate in the cross-system integrity that multi-agent deployments require. The path from architectural principle to deployed standard is a coordination and incentive problem that the quality of the specification alone does not close.

7Related Work and Positioning

Recent literature makes it possible to position AEI more precisely than a broad "the field lacks memory" claim would suggest.

Memory taxonomy and management. CoALA [CoALA] adapts the cognitive science taxonomy of semantic, episodic, and procedural memory to language agents, providing widely adopted descriptive vocabulary — but it is descriptive rather than prescriptive, and was developed for session-bounded agents without addressing what the long-running regime demands. MemGPT [MemGPT] treats context window management as a virtual memory problem. The gap is that virtual memory does not tell you which cached pages have been invalidated by the world — retrieved and acted-upon beliefs can go stale between retrievals.

More recent work increasingly treats memory as persistent infrastructure. A broad survey documents a rapidly fragmenting but maturing memory landscape [MemAge]. Continuum Memory Architecture argues specifically against stateless RAG for long-horizon agents [Continuum]. Memori emphasizes structured persistent memory layers and memory formation pipelines [Memori]. Kumiho formalizes memory updates through belief-revision semantics, the closest existing work to the revision mechanism AEI presupposes in its Knowing domain [Kumiho].

Reasoning, acting, and self-correction. ReAct [ReAct] and Reflexion [Reflexion] interleave reasoning and action, allowing agents to revise their approach based on feedback. These are genuine contributions to within-task epistemic adjustment. Their scope is the task. They do not address what happens when the task outlives the session, or when prior tasks' partial completions must be reconciled with current intent.

Governance and steerability. The OpenAI governance practices [OpenAI] address interruptibility and human oversight at the policy layer. The treatment is operationally meaningful but architecturally uncommitted — it describes how humans can intervene without specifying how the agent should be structured to make intervention reliable. AEI argues that steerability has deep architectural dependencies on how the agent represents its forward intent.

Evaluation. Existing benchmarks predominantly measure turn-level correctness. A growing body of work pushes toward richer, more realistic views of agent competence. τ-bench evaluates tool-agent-user interaction in real-world domains [τ-bench]. When2Call examines the harder question of when agents should not call tools [When2Call]. MCPVerse provides an expansive benchmark for agentic tool use [MCPVerse]. FinTrace and ATBench both emphasize holistic trajectory-level evaluation of long-horizon tasks and safety-relevant agent behavior [FinTrace, ATBench]. This direction is precisely the evaluation regime that AEI requires.

Calibration. Kadavath et al. [Kadavath] establish that language models can estimate their own uncertainty to a useful but imperfect degree. Any framework treating uncertainty as a first-class runtime signal depends on this empirical foundation and inherits its limitations. §6.3 addresses this directly.

Positioning. Against this background, the novelty claim of AEI should be stated narrowly. AEI is not the first account of persistent memory, not the first argument for belief revision, and not the first call for trajectory-level evaluation. Its narrower claim is that long-running agent failures are best understood as a coupled runtime integrity problem spanning belief maintenance, capability state, and steerability. The contribution is to treat that coupling — not any single component — as the primary object of architectural design.

8Conclusion

As agentic systems move from session-bounded interactions to persistent operation, the field needs a clearer vocabulary for what actually breaks. The central problem is not simply that long-running agents need more memory. It is that what they know, what they have already done, and what they are still trying to do must remain coherent and correctable across time.

Agent Epistemic Integrity is offered as a framework for naming that requirement.

The framework is intentionally stronger than a memory taxonomy and intentionally weaker than a complete formal system. Its immediate value is architectural: it highlights why stale beliefs, partial side effects, and goal drift cannot be treated as independent edge cases once agents persist across sessions. Its longer-term value, if the framing proves useful, would be to guide schemas, benchmarks, and runtime interfaces that make long-running agents easier to inspect, interrupt, and trust.

The stable what / volatile how principle is the organizing commitment underneath the framework. The requirement to maintain coherent, correctable epistemic state across Knowing, Doing, and Deciding does not change as models improve or deployment contexts diversify. The mechanisms — retrieval strategies, uncertainty scoring, capability logs, prospective memory schemas — will keep evolving. Systems built around the stable requirement will remain architecturally coherent across that evolution. Systems built around today's implementation patterns will require reinvention each time those patterns change.

The long-running agent is not a future concern. It is a present deployment reality, and the systems running in that regime today are, in the main, session-bounded architectures operating beyond their design envelope. The failure modes this paper describes — stale beliefs acted upon as current, side effects not tracked, goals pursued without surfaces for correction — are happening now, without the architectural vocabulary to name them clearly. This paper is an attempt to supply that vocabulary.

—References

[Amodei] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mané, D. (2016). Concrete Problems in AI Safety. arXiv:1606.06565.
[ATBench] Li, Y., Luo, H., Xie, Y., Fu, Y., Yang, Z., Shao, S., Ren, Q., Qu, W., Fu, Y., Yang, Y., Shao, J., Hu, X., & Liu, D. (2026). ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis. arXiv:2604.02022.
[Brandimonte] Brandimonte, M. A., Einstein, G. O., & McDaniel, M. A. (Eds.). (1996). Prospective Memory: Theory and Applications. Lawrence Erlbaum Associates.
[CoALA] Sumers, T. R., Yao, S., Narasimhan, K., & Griffiths, T. L. (2023). Cognitive Architectures for Language Agents. arXiv:2309.02427.
[Continuum] Logan, J. (2026). Continuum Memory Architectures for Long-Horizon LLM Agents. arXiv:2601.09913.
[FinTrace] Cao, Y., Li, H., Liu, W., Cao, W., Xu, A., Qian, L., Peng, X., Tang, M., Yao, Z., Huang, J., Subbalakshmi, K. P., Zhu, Z., Suchow, J. W., & Yu, Y. (2026). FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks. arXiv:2604.10015.
[Greshake] Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv:2302.12173.
[Kadavath] Kadavath, S., Conerly, T., Askell, A., et al. (2022). Language Models (Mostly) Know What They Don't Know. arXiv:2207.05221.
[Kumiho] Park, Y. B. (2026). Graph-Native Cognitive Memory for AI Agents: Formal Belief Revision Semantics for Versioned Memory Architectures. arXiv:2603.17244.
[MCPVerse] Lei, F., Yang, Y., Sun, W., & Lin, D. (2025). MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use. arXiv:2508.16260.
[MemAge] Hu, Y., Liu, S., Yue, Y., Zhang, G., Liu, B., Zhu, F., Lin, J., Guo, H., Dou, S., Xi, Z., et al. (2025). Memory in the Age of AI Agents. arXiv:2512.13564.
[MemGPT] Packer, C., Wooders, S., Lin, K., Fang, V., Patil, S. G., Stoica, I., & Gonzalez, J. E. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.
[Memori] Borro, L. C., Macarini, L. A. B., Tindall, G., Montero, M., & Struck, A. B. (2026). Memori: A Persistent Memory Layer for Efficient, Context-Aware LLM Agents. arXiv:2603.19935.
[OpenAI] OpenAI. (2023). Practices for Governing Agentic AI Systems. openai.com.
[ReAct] Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
[Reflexion] Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.
[τ-bench] Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045.
[When2Call] Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851.