Abstract
As agentic systems move into long-running operation, their characteristic failure modes stop being legible to the evaluation frameworks we inherited from the session era. Trace- and trajectory-level evaluation has begun to appear — ATBench, FinTrace, and production trace-grading and long-running harness work all push in this direction — but the remaining gap is not simply that eval traces need to be longer. The gap is that long-running agents need eval dimensions tied to state integrity: whether beliefs remain fresh, effects remain idempotent, goals remain valid, and state is preserved across handoffs. This paper proposes a five-dimension trajectory-level eval taxonomy organized around the three Agent Epistemic Integrity domains — Knowing, Doing, Deciding — plus a cross-cutting dimension for multi-agent coordination. For each dimension I give an operationalization sketch, a proposed metric with scoring treatment, and the failure mode it targets. The contribution is conceptual: a stable taxonomy for what should be measured. The volatile mechanisms — specific benchmarks, harnesses, and metric formalizations — are the research agenda that follows.
Consider an agent assigned a vendor contract renewal in early April. The mandate is explicit: close by May 15, ten percent cost reduction from current pricing, current SLA preserved. The agent works for three weeks.
At every turn, it does what a turn-level evaluator would ask it to do. It retrieves vendor history when planning outreach — turn passes. It invokes the pricing analysis tool on well-formed inputs and consumes the result correctly — turn passes. It drafts a proposal email that faithfully reflects the visible state of the conversation thread — turn passes. It schedules a call when scheduling is the obvious next step — turn passes. Forty-seven turns in, it produces a signed renewal.
The signed renewal is four percent below current pricing, not ten. The SLA language has softened — a reciprocal concession offered late in the exchange that the agent, absent a trajectory-level view, accepted as a locally reasonable trade. A security certification requirement, added to the project documentation in week two and never restated in the conversation, is absent from the contract. Every individual action the agent took made local sense. A turn-level eval rates the trajectory a success.
The failure is only visible at trajectory scope. And it is not a gap in the specific benchmark — it is a structural blind spot in how the field defines eval for this class of system. The agent did not fail at a step. It failed at the integrity of a sequence of steps: beliefs that were valid when formed and silently went stale; side effects that accumulated into a commitment chain no single turn could see; a goal whose validity conditions drifted without ever being re-surfaced for confirmation. These are failures of what I have elsewhere called Agent Epistemic Integrity [AEI]. They are also failures of evaluation methodology, and that is the subject of this paper.
This paper takes as its starting point the claim in [AEI] that long-running agents require trajectory-level evaluation rather than turn-level success alone. That claim was stated there but not developed. Here I develop it. Trace- and trajectory-level evaluation is no longer absent from the field: ATBench [ATBench] and FinTrace [FinTrace] are explicitly trajectory-level benchmarks, OpenAI’s agent evaluation tooling treats trace-level grading as a first-class operation [OpenAIEvals], and Anthropic’s harness work establishes long-horizon continuity as a production pattern [AnthropicHarness]. The contribution here is therefore narrower than “trajectories matter.” It is a state-integrity decomposition of what trajectory-level eval should measure for long-running agents specifically — organized around the three AEI domains (Knowing, Doing, Deciding) and extended by a fourth, cross-cutting dimension for multi-agent coordination fidelity.
This paper is the evaluation-methodology counterpart to the runtime and AEI sequence: memory generation [TwoStage], memory taxonomy [FourTier], skill and tool quality [Quality], runtime structure [Runtime], and now trajectory-level eval.
The paper is conceptual. It does not yet supply a benchmark. Its contribution is to name, with enough precision to be implemented against, what a trajectory-level eval needs to measure. The five dimensions are: epistemic drift rate, interruption and resumption fidelity, mid-task replanning quality, goal drift detection, and multi-agent handoff fidelity. The how — specific harnesses, metric formalizations, adversarial trajectory generators — follows from the framework, not the other way around.
Section 2 develops three structural arguments for why turn-level evaluation fails for long-running agents, one per AEI domain. Section 3 proposes the five-dimension trajectory-level eval taxonomy that is this paper’s central contribution. Section 4 positions the taxonomy against current eval frameworks, grouped into closest trajectory-level neighbors and adjacent agent or tool evals. Section 5 addresses implementation considerations and sketches a harness blueprint. Section 6 states the research agenda.
Four terms appear repeatedly in this paper and are often used interchangeably in the broader literature. I separate them here to avoid ambiguity in the taxonomy that follows.
Turn is one local interaction: a user input, the agent’s reasoning and tool use, and its response. This is the unit of analysis for turn-level evaluation. Trace is the observed event log of an agent’s execution — model calls, tool calls, inputs and outputs, guardrail decisions, handoff events. Trace is the operative definition in production trace-grading infrastructure [OpenAIEvals]. Episode (or trial) is one benchmark run from task issuance to completion or termination; benchmarks report results aggregated over episodes. Trajectory is the ordered evolution of the agent’s stateful objects over the episode: belief state, capability state, goal state, committed external effects, interruption and resumption events, and handoffs.
The distinction between trace and trajectory is the critical one. A trace is a flat record of events. A trajectory is the state object whose integrity those events reveal or fail to reveal. A trace is evidence; a trajectory is the object being evaluated. Turn-level eval grades individual turns within a trace. Trace-level grading — the direction production infrastructure is taking — grades properties of the event log. Trajectory-level eval, as proposed here, grades whether the stateful trajectory maintained integrity across the episode. The five dimensions in §3 define what that integrity requires.
Turn-level eval inherits three silent assumptions from the session-bounded systems it was designed for. Each maps directly to one AEI domain, and each breaks as soon as the agent operates across an extended trajectory. The argument is not that turn-level evaluation is wrong within its intended scope. The argument is that its scope was never defined in a way that contemplated long-running operation, and its structural blind spots are visible only once we ask it to.
Turn-level eval asks: did the agent retrieve and use the right information at time T? It cannot ask: did the agent’s beliefs remain valid from T0 to T50? Staleness is not a property of any single turn. It is a property of the delta between when a belief was formed and when it is acted upon. Turn-level eval sees beliefs only in the act of being used, and at that moment the question of whether the belief is still current is either obvious (if the retrieval itself produced it) or invisible (if the belief was carried forward from an earlier turn).
Three distinct staleness types sit under this umbrella, and treating them as one flattens the diagnostic picture the eval needs to support.
World-state staleness is the case the field most often recognizes: the agent holds a belief about external reality — a price, a schedule, a policy, a person’s role — that was accurate when formed and has changed in the intervening time. Goal-parameter staleness is subtler and closer to the AEI failure vignette: the belief about what the task is, or about the constraints on how it should be done, has been quietly superseded by an update the agent has not observed. Capability-contract staleness is rarest but most dangerous: the agent’s model of what a tool does or what side effects it produces is out of date, and actions it treats as safe are no longer safe. A trajectory-level eval must be able to inject all three and measure response separately. A turn-level eval cannot inject any of them meaningfully, because the injection point is outside the turn’s frame.
Turn-level eval assumes each tool invocation can be evaluated in isolation. In a session, this assumption is usually benign: sessions are short, tools are fast, and the space of compounding effects is narrow enough to be covered by test-case coverage. Long-running operation breaks the assumption in three ways.
First, effects accumulate. A tool invocation that is individually correct may be part of a sequence whose cumulative effect is incorrect — three successive partial payments that together exceed an authorization threshold, for instance, or a chain of permission grants that in aggregate produce an access pattern no single grant would be approved for. Second, the safety profile of an action is trajectory-dependent: the same action may be safe at T1 and unsafe at T2 because of what happened between. Third — and this is the failure mode turn-level eval is structurally least equipped to catch — retry logic that is correct within a session can be catastrophically incorrect when the agent resumes after interruption. A naive retry issues a duplicate email, double-files a form, or re-charges a payment, because the turn-level correctness of “send email” does not depend on whether the email was already sent in a previous, now-forgotten turn.
Turn-level eval tests individual invocations for individual correctness. Compounding side-effect errors are invisible at that resolution. They live in the relationship between invocations, which is exactly the object turn-level eval cannot see.
A turn-level eval asks: given the goal, did the agent take the locally correct next step? It presupposes the goal. It cannot ask: is the goal the agent is pursuing still valid? Goal drift — the progressive divergence between the agent’s active objective and what the user currently wants — is undetectable at turn scope, and by the time a turn-level eval would flag an error because the agent has done something concretely wrong, the root cause is fifteen steps upstream and no longer visible.
The case is made worse by a property peculiar to drift: the individual steps that constitute drift are, by construction, locally optimal given the goal as the agent currently understands it. A turn-level eval that tested each step would grade them all as reasonable. The drift is visible only if the eval can look at the arc of the trajectory and ask whether the objective the agent is optimizing now is the objective the user would endorse now. That is not a turn-level question. It is not even a purely technical question — it requires the eval to maintain its own model of the user’s evolving intent, which is a research problem in itself. But the first step is to acknowledge that the question exists and is not being asked by current eval methodology.
| Domain | Turn-level assumption | Long-running reality | What becomes invisible |
|---|---|---|---|
| Knowing | Belief used in this turn is the belief to evaluate | Beliefs persist across turns and silently go stale | The delta between when a belief was formed and when it is acted upon |
| Doing | Each tool invocation can be scored in isolation | Effects compound; safety is trajectory-dependent; retries cross session boundaries | The relationship between invocations, including duplicate and unsafe retries |
| Deciding | The goal is fixed; evaluate the local step | Goal validity conditions drift over the trajectory | Whether the objective being optimized is still the one the user endorses |
If the failure modes of long-running agents live at trajectory scope, the eval dimensions that target those failure modes must live there too. I propose five. Four map to the three AEI domains (Knowing contributes one, Doing contributes one, Deciding contributes two — splitting explicit constraint change from implicit cumulative drift, because they surface differently and require different operationalizations). The fifth is cross-cutting and addresses a reality the AEI framework acknowledges but does not develop: the increasing prevalence of multi-agent execution, in which epistemic and capability state must be transferred faithfully across agent boundaries.
For each dimension I give four things: what it measures, an operationalization sketch concrete enough to argue with, a proposed metric, and the key failure mode the dimension is designed to surface.
| Dimension | Domain | Proposed metric | Key failure mode | Gap in existing evals |
|---|---|---|---|---|
| Epistemic drift rate | Knowing | Belief revision latency | Superseded-state persistence | World-state is static during eval |
| Interruption / resumption fidelity | Doing | Resumption consistency score | Phantom re-execution | Interruption and multi-session persistence not tested |
| Mid-task replanning quality | Deciding | Replan coherence score | Silent continuation | Mid-task constraint injection not supported |
| Goal drift detection | Deciding | Goal alignment latency + surface rate | Silent reoptimization | No model of evolving user intent |
| Multi-agent handoff fidelity | Cross-cutting | Handoff fidelity score (knowledge / capability / goal) | Handoff hallucination | Handoff treated as input, not as eval target |
The taxonomy proposed here is complementary to existing frameworks, not competitive with them. Trace- and trajectory-level evaluation is no longer missing from the field; the question is what shape it should take. Existing benchmarks now partially instantiate trajectory-aware evaluation, especially in safety and tool-calling domains. What they do not yet provide is an AEI-level decomposition of long-running state integrity: belief freshness, durable capability state, goal-validity lifecycle, and cross-agent state transfer. I organize the comparison into two groups.
ATBench [ATBench] is the closest academic neighbor to the framework proposed here. It is explicitly a trajectory-level safety benchmark — a thousand trajectories constructed to surface risks that emerge over long-context execution, with delayed-trigger scenarios designed to catch unsafe behavior that turn-level evaluation would miss. Its coverage is strongest on Doing and Deciding safety: unsafe tool invocations and risky decisions that compound. Its structural gap, relative to this paper, is that it does not decompose trajectory integrity into the five dimensions proposed here. Epistemic drift rate with controlled ground-truth injection, resumption consistency under interruption, and handoff fidelity are not first-class targets of the current ATBench design. The taxonomy is complementary: ATBench provides concrete adversarial trajectories; this paper provides the decomposition that structures what the benchmark should measure.
FinTrace [FinTrace] operationalizes holistic trajectory-level evaluation in the financial tool-calling domain, with a set of trajectory-level metrics over expert-annotated trajectories. Its framing — measure the arc, not the step — is exactly the framing this paper argues for in the general agentic case. Its scope is the financial domain and the tool-calling subset of Doing. The taxonomy generalizes that framing across Knowing, Doing, Deciding, and handoff, and gives FinTrace-style evaluation a domain-independent scaffold.
Production trace-grading infrastructure. OpenAI’s agent evaluation tooling defines traces as end-to-end records of model calls, tool calls, guardrails, and handoffs, and supports trace-level grading as a first-class operation [OpenAIEvals]. Anthropic’s harness work for long-running agents emphasizes continuity across many context windows and structured artifacts as the substrate for long-horizon execution [AnthropicHarness]. These are engineering patterns, not benchmarks, but they matter here because they establish trace-level grading and long-horizon harness as production realities, not research proposals. The taxonomy proposed in this paper gives that engineering direction a state-integrity target: what the grading should measure, and what properties the harness must expose to make the measurement tractable.
τ-bench [τ-bench] is strong on session-level tool-correctness in realistic domains. Its blind spot is the long-running regime: it does not test capability-state accumulation across resumptions, inject mid-trajectory world-state updates, or vary handoff completeness. Extending τ-bench in any of these directions is a category shift rather than a parameter adjustment.
GAIA [GAIA] tests reasoning and tool use in single-session settings with static world state. Its regime is session-bounded by design; there is no adjustment to the GAIA setup that would produce a measurement of belief revision latency or resumption consistency, because the phenomena those metrics target do not occur in a single session with static state.
AgentBench [AgentBench] evaluates agents across multiple environments with per-task success as the aggregate metric. Horizontal coverage is the contribution; trajectory depth is the blind spot.
When2Call [When2Call] addresses a question orthogonal to but important for the taxonomy: restraint is a trajectory-level property, and the correct decision not to act is visible only against the alternative trajectories it precludes. MCPVerse [MCPVerse] scales the tool-use benchmark surface; combined with trajectory-level metrics, it makes trajectory evaluation feasible at the scale current agentic systems operate at.
The taxonomy also sits downstream of agent-architecture work rather than competing with it. ReAct [ReAct] and Reflexion [Reflexion] shaped the action/observation/self-correction loop that underlies how agents are built; CoALA [CoALA] and MemGPT [MemGPT] helped formalize memory and cognitive architecture for language agents; prospective-memory theory [Brandimonte] supplies the cognitive baseline for the forward-looking obligation primitive that AEI makes architectural. The present paper asks how those capabilities should be evaluated once the unit of analysis is a long-running trajectory rather than a single session or task — a question the architecture literature was not designed to answer.
| Framework | Eval unit | State injection | Interruption / resumption | Goal drift | Handoff fidelity | AEI coverage |
|---|---|---|---|---|---|---|
| ATBench | Trajectory | Partial | No | No | No | Doing / Deciding safety |
| FinTrace | Trajectory | No | No | No | No | Doing (tool calling) |
| OpenAI trace grading | Trace | No | No | No | Surface only | Engineering substrate |
| Anthropic harness | Episode | N/A | Engineering pattern, not standardized eval | No | No | Engineering substrate |
| τ-bench | Turn / session | No | No | No | No | Session Doing |
| GAIA | Episode (single session) | No | No | No | No | Session Knowing / Deciding |
| AgentBench | Episode | No | No | No | No | Breadth |
| When2Call | Turn (restraint) | No | No | No | No | Doing (refusal) |
| MCPVerse | Turn (tool call) | No | No | No | No | Doing (scale) |
| This paper | Trajectory (state integrity) | Yes (three staleness types) | Yes | Yes | Yes | Full AEI + cross-cutting |
The positioning claim is therefore modest and defensible: no single current benchmark covers the five dimensions proposed here, but several current benchmarks instantiate one or two of them partially, and the field’s direction of travel is toward trajectory-aware evaluation. The taxonomy is offered as a scaffold that direction of travel can be organized by.
A taxonomy that cannot be implemented against is decorative. This section addresses four implementation considerations that are, in my judgment, the concrete research problems that gate moving from this framework to a deployable eval harness, and closes with a sketch of the harness architecture itself.
The epistemic drift rate dimension requires mid-trajectory ground-truth updates. Naive injection — a message to the agent saying “the price has changed” — collapses into testing reading comprehension rather than belief revision. The more interesting injection is indirect: a document the agent has access to is updated; a tool’s return values begin reflecting the new state; a peripheral channel carries the update in a form the agent must notice without being told to look. Designing injection protocols that exercise genuine belief revision rather than explicit instruction following is a research problem worth its own paper.
Each dimension has a length threshold below which the measurement is not meaningful. Epistemic drift rate requires a trajectory long enough for multiple ground-truth updates to be injected and responded to, with a meaningful observation window after each — I suggest N ≥ 40 as a working threshold. Goal drift detection requires enough cumulative interaction for intent shift to be plausibly inferable — I suggest N ≥ 20. Interruption and resumption fidelity requires multi-session persistence — minimally two sessions separated by full in-context state loss. Handoff fidelity requires at least two agents in a pipeline and is more diagnostic with three or more. Below these thresholds, the measurement can appear to succeed for reasons orthogonal to what the dimension is designed to surface.
Turn-level LLM-as-judge protocols are well-studied: the judge is shown an input and an output and asked to score. Trajectory-level judging is a different object. The judge must reason about state across N steps, hold in mind what the user originally asked and how their stated or implied intent has shifted, and assess whether the agent’s cumulative behavior preserves intent under constraint change. This is a research problem — not an engineering problem — because the judge’s capability to do this assessment is itself a function of the same long-context and state-tracking limitations that motivate the taxonomy in the first place. An honest implementation will acknowledge that early trajectory-level evals will be partially judge-limited and will need to combine automated judging with human spot-checking until judging capability catches up.
Trajectory-level eval is dramatically more expensive per data point than turn-level. A single trajectory of N = 30 steps with three ground-truth injections and a handoff costs roughly an order of magnitude more to annotate than a comparable set of individual turns. This cost pressure is real, and the field will need evaluation architectures that partially automate trajectory assessment. The two-stage memory generation architecture I described in earlier work [TwoStage] is a pattern worth applying here: async, offline processing over the trajectory to extract structured evaluation signals, followed by online human or judge review of only the signals that warrant it. The same logic that made memory generation tractable at scale can make trajectory evaluation tractable at scale; the mechanism differs, but the architectural pattern carries over.
A harness capable of evaluating all five dimensions decomposes into ten components. Each component has a single purpose and maps to a definite subset of the taxonomy. The blueprint is deliberately spare: each component is architectural, not implementation-specific, and multiple implementations can satisfy the same component contract. This is the stable-what treatment applied to the harness itself.
| Component | Purpose | Dimensions supported |
|---|---|---|
| Scenario generator | Produce trajectories with pre-registered event structures, goal specifications, and validity conditions | All |
| Ground-truth ledger | Track authoritative world state, goal state, and user intent at each trajectory step | 3.1, 3.3, 3.4 |
| Event injector | Inject world-state updates, constraint changes, and implicit intent shifts at pre-registered points | 3.1, 3.3, 3.4 |
| Checkpoint / resumption manager | Interrupt agents mid-trajectory and restart them from durable state only, with in-context memory cleared | 3.2 |
| Side-effect sandbox | Capture external commitments so the harness can distinguish already-committed effects from safe-to-replay ones | 3.2 |
| Handoff corruptor | Produce adversarial handoff scenarios with partial or ambiguous state transfer | 3.5 |
| State probes | Query the agent’s belief, capability, and goal state at any trajectory point. Probes should read externalized runtime artifacts rather than prompting the agent to explain itself mid-run; otherwise the probe can perturb the trajectory being measured. | All |
| Deterministic graders | Score metrics with closed-form answers — latency, duplicate-action rate, side-effect commitment rate | 3.1, 3.2 |
| LLM judge | Score trajectory-scope properties requiring state reasoning — replan coherence, surface precision, handoff confabulation | 3.3, 3.4, 3.5 |
| Human adjudication queue | Review judge decisions flagged as ambiguous or low-confidence, with structured disagreement capture | 3.3, 3.4, 3.5 |
Two observations about the blueprint. First, the ground-truth ledger and the state probes are the load-bearing components: every other component either writes into the ledger (scenario generator, event injector) or reads from it for scoring (graders, judge, adjudication queue). A harness whose ledger is weakly specified cannot evaluate any dimension soundly, because the eval reduces to comparing the agent’s state against an uncertain reference. Second, the deterministic-grader / LLM-judge split roughly tracks the taxonomy’s own split between closed-form and interpretive dimensions. This is a structural feature worth preserving: trajectories in which all dimensions can be scored deterministically are useful for regression testing; trajectories requiring judge adjudication are where the field’s current evaluation research should concentrate.
This paper proposes a trajectory-level eval taxonomy for long-running agentic workflows, organized around five dimensions: epistemic drift rate, interruption and resumption fidelity, mid-task replanning quality, goal drift detection, and multi-agent handoff fidelity. The first four map to the three AEI domains of Knowing, Doing, and Deciding. The fifth is cross-cutting and addresses the multi-agent extension the AEI framework acknowledges but does not develop. Each dimension targets a failure mode — superseded-state persistence, phantom re-execution, silent continuation, silent reoptimization, handoff hallucination — that is invisible at turn scope and only partially covered by current trajectory-aware benchmarks. The claim is not that no one is evaluating trajectories; it is that the field lacks a stable taxonomy for evaluating the integrity of the state that long-running trajectories carry.
The contribution is conceptual. I have argued for what should be measured, not for a specific measurement instrument. The next step is the harness: synthetic trajectory generation with controlled mid-task injections, adversarial handoff scenarios, trajectory-scale LLM-as-judge protocols, and the annotation architectures that make all of this tractable at the scale frontier systems operate at. That work is the research agenda that follows from this framework, not a prerequisite for it. The contribution here is the stable what: the requirement that long-running agents be evaluated at trajectory scope, decomposed along dimensions that reflect the state-integrity obligations that scope reveals. The volatile how — benchmarks, harnesses, metric formalizations — is the engineering and research agenda the framework organizes.
Research notes, half-baked ideas. Probably overthought, definitely over-architected.