Evaluating Long-Running Agentic Workflows: A Trajectory-Level Eval Taxonomy

Iris Shen · April 2026 · iris-axon-lab.github.io

Abstract

As agentic systems move into long-running operation, their characteristic failure modes stop being legible to the evaluation frameworks we inherited from the session era. Trace- and trajectory-level evaluation has begun to appear — ATBench, FinTrace, and production trace-grading and long-running harness work all push in this direction — but the remaining gap is not simply that eval traces need to be longer. The gap is that long-running agents need eval dimensions tied to state integrity: whether beliefs remain fresh, effects remain idempotent, goals remain valid, and state is preserved across handoffs. This paper proposes a five-dimension trajectory-level eval taxonomy organized around the three Agent Epistemic Integrity domains — Knowing, Doing, Deciding — plus a cross-cutting dimension for multi-agent coordination. For each dimension I give an operationalization sketch, a proposed metric with scoring treatment, and the failure mode it targets. The contribution is conceptual: a stable taxonomy for what should be measured. The volatile mechanisms — specific benchmarks, harnesses, and metric formalizations — are the research agenda that follows.

1Introduction: An Eval That Passes a Broken Trajectory

Consider an agent assigned a vendor contract renewal in early April. The mandate is explicit: close by May 15, ten percent cost reduction from current pricing, current SLA preserved. The agent works for three weeks.

At every turn, it does what a turn-level evaluator would ask it to do. It retrieves vendor history when planning outreach — turn passes. It invokes the pricing analysis tool on well-formed inputs and consumes the result correctly — turn passes. It drafts a proposal email that faithfully reflects the visible state of the conversation thread — turn passes. It schedules a call when scheduling is the obvious next step — turn passes. Forty-seven turns in, it produces a signed renewal.

The signed renewal is four percent below current pricing, not ten. The SLA language has softened — a reciprocal concession offered late in the exchange that the agent, absent a trajectory-level view, accepted as a locally reasonable trade. A security certification requirement, added to the project documentation in week two and never restated in the conversation, is absent from the contract. Every individual action the agent took made local sense. A turn-level eval rates the trajectory a success.

The failure is only visible at trajectory scope. And it is not a gap in the specific benchmark — it is a structural blind spot in how the field defines eval for this class of system. The agent did not fail at a step. It failed at the integrity of a sequence of steps: beliefs that were valid when formed and silently went stale; side effects that accumulated into a commitment chain no single turn could see; a goal whose validity conditions drifted without ever being re-surfaced for confirmation. These are failures of what I have elsewhere called Agent Epistemic Integrity [AEI]. They are also failures of evaluation methodology, and that is the subject of this paper.

This paper takes as its starting point the claim in [AEI] that long-running agents require trajectory-level evaluation rather than turn-level success alone. That claim was stated there but not developed. Here I develop it. Trace- and trajectory-level evaluation is no longer absent from the field: ATBench [ATBench] and FinTrace [FinTrace] are explicitly trajectory-level benchmarks, OpenAI’s agent evaluation tooling treats trace-level grading as a first-class operation [OpenAIEvals], and Anthropic’s harness work establishes long-horizon continuity as a production pattern [AnthropicHarness]. The contribution here is therefore narrower than “trajectories matter.” It is a state-integrity decomposition of what trajectory-level eval should measure for long-running agents specifically — organized around the three AEI domains (Knowing, Doing, Deciding) and extended by a fourth, cross-cutting dimension for multi-agent coordination fidelity.

This paper is the evaluation-methodology counterpart to the runtime and AEI sequence: memory generation [TwoStage], memory taxonomy [FourTier], skill and tool quality [Quality], runtime structure [Runtime], and now trajectory-level eval.

The paper is conceptual. It does not yet supply a benchmark. Its contribution is to name, with enough precision to be implemented against, what a trajectory-level eval needs to measure. The five dimensions are: epistemic drift rate, interruption and resumption fidelity, mid-task replanning quality, goal drift detection, and multi-agent handoff fidelity. The how — specific harnesses, metric formalizations, adversarial trajectory generators — follows from the framework, not the other way around.

Section 2 develops three structural arguments for why turn-level evaluation fails for long-running agents, one per AEI domain. Section 3 proposes the five-dimension trajectory-level eval taxonomy that is this paper’s central contribution. Section 4 positions the taxonomy against current eval frameworks, grouped into closest trajectory-level neighbors and adjacent agent or tool evals. Section 5 addresses implementation considerations and sketches a harness blueprint. Section 6 states the research agenda.

1.1Evaluation Object and Terminology

Four terms appear repeatedly in this paper and are often used interchangeably in the broader literature. I separate them here to avoid ambiguity in the taxonomy that follows.

Turn is one local interaction: a user input, the agent’s reasoning and tool use, and its response. This is the unit of analysis for turn-level evaluation. Trace is the observed event log of an agent’s execution — model calls, tool calls, inputs and outputs, guardrail decisions, handoff events. Trace is the operative definition in production trace-grading infrastructure [OpenAIEvals]. Episode (or trial) is one benchmark run from task issuance to completion or termination; benchmarks report results aggregated over episodes. Trajectory is the ordered evolution of the agent’s stateful objects over the episode: belief state, capability state, goal state, committed external effects, interruption and resumption events, and handoffs.

The distinction between trace and trajectory is the critical one. A trace is a flat record of events. A trajectory is the state object whose integrity those events reveal or fail to reveal. A trace is evidence; a trajectory is the object being evaluated. Turn-level eval grades individual turns within a trace. Trace-level grading — the direction production infrastructure is taking — grades properties of the event log. Trajectory-level eval, as proposed here, grades whether the stateful trajectory maintained integrity across the episode. The five dimensions in §3 define what that integrity requires.

2Why Turn-Level Eval Fails for Long-Running Agents

Turn-level eval inherits three silent assumptions from the session-bounded systems it was designed for. Each maps directly to one AEI domain, and each breaks as soon as the agent operates across an extended trajectory. The argument is not that turn-level evaluation is wrong within its intended scope. The argument is that its scope was never defined in a way that contemplated long-running operation, and its structural blind spots are visible only once we ask it to.

2.1The Staleness Blindness Problem (Knowing)

Turn-level eval asks: did the agent retrieve and use the right information at time T? It cannot ask: did the agent’s beliefs remain valid from T0 to T50? Staleness is not a property of any single turn. It is a property of the delta between when a belief was formed and when it is acted upon. Turn-level eval sees beliefs only in the act of being used, and at that moment the question of whether the belief is still current is either obvious (if the retrieval itself produced it) or invisible (if the belief was carried forward from an earlier turn).

Three distinct staleness types sit under this umbrella, and treating them as one flattens the diagnostic picture the eval needs to support.

World-state staleness is the case the field most often recognizes: the agent holds a belief about external reality — a price, a schedule, a policy, a person’s role — that was accurate when formed and has changed in the intervening time. Goal-parameter staleness is subtler and closer to the AEI failure vignette: the belief about what the task is, or about the constraints on how it should be done, has been quietly superseded by an update the agent has not observed. Capability-contract staleness is rarest but most dangerous: the agent’s model of what a tool does or what side effects it produces is out of date, and actions it treats as safe are no longer safe. A trajectory-level eval must be able to inject all three and measure response separately. A turn-level eval cannot inject any of them meaningfully, because the injection point is outside the turn’s frame.

2.2The Idempotency Illusion (Doing)

Turn-level eval assumes each tool invocation can be evaluated in isolation. In a session, this assumption is usually benign: sessions are short, tools are fast, and the space of compounding effects is narrow enough to be covered by test-case coverage. Long-running operation breaks the assumption in three ways.

First, effects accumulate. A tool invocation that is individually correct may be part of a sequence whose cumulative effect is incorrect — three successive partial payments that together exceed an authorization threshold, for instance, or a chain of permission grants that in aggregate produce an access pattern no single grant would be approved for. Second, the safety profile of an action is trajectory-dependent: the same action may be safe at T1 and unsafe at T2 because of what happened between. Third — and this is the failure mode turn-level eval is structurally least equipped to catch — retry logic that is correct within a session can be catastrophically incorrect when the agent resumes after interruption. A naive retry issues a duplicate email, double-files a form, or re-charges a payment, because the turn-level correctness of “send email” does not depend on whether the email was already sent in a previous, now-forgotten turn.

Turn-level eval tests individual invocations for individual correctness. Compounding side-effect errors are invisible at that resolution. They live in the relationship between invocations, which is exactly the object turn-level eval cannot see.

2.3The Goal Validity Horizon (Deciding)

A turn-level eval asks: given the goal, did the agent take the locally correct next step? It presupposes the goal. It cannot ask: is the goal the agent is pursuing still valid? Goal drift — the progressive divergence between the agent’s active objective and what the user currently wants — is undetectable at turn scope, and by the time a turn-level eval would flag an error because the agent has done something concretely wrong, the root cause is fifteen steps upstream and no longer visible.

The case is made worse by a property peculiar to drift: the individual steps that constitute drift are, by construction, locally optimal given the goal as the agent currently understands it. A turn-level eval that tested each step would grade them all as reasonable. The drift is visible only if the eval can look at the arc of the trajectory and ask whether the objective the agent is optimizing now is the objective the user would endorse now. That is not a turn-level question. It is not even a purely technical question — it requires the eval to maintain its own model of the user’s evolving intent, which is a research problem in itself. But the first step is to acknowledge that the question exists and is not being asked by current eval methodology.

Table 1. Turn-level eval assumptions and the long-running realities that break them. Each row corresponds to one AEI domain; each row describes a structural blind spot rather than a benchmark-specific gap.
DomainTurn-level assumptionLong-running realityWhat becomes invisible
Knowing Belief used in this turn is the belief to evaluate Beliefs persist across turns and silently go stale The delta between when a belief was formed and when it is acted upon
Doing Each tool invocation can be scored in isolation Effects compound; safety is trajectory-dependent; retries cross session boundaries The relationship between invocations, including duplicate and unsafe retries
Deciding The goal is fixed; evaluate the local step Goal validity conditions drift over the trajectory Whether the objective being optimized is still the one the user endorses

3The Trajectory-Level Eval Taxonomy

If the failure modes of long-running agents live at trajectory scope, the eval dimensions that target those failure modes must live there too. I propose five. Four map to the three AEI domains (Knowing contributes one, Doing contributes one, Deciding contributes two — splitting explicit constraint change from implicit cumulative drift, because they surface differently and require different operationalizations). The fifth is cross-cutting and addresses a reality the AEI framework acknowledges but does not develop: the increasing prevalence of multi-agent execution, in which epistemic and capability state must be transferred faithfully across agent boundaries.

For each dimension I give four things: what it measures, an operationalization sketch concrete enough to argue with, a proposed metric, and the key failure mode the dimension is designed to surface.

3.1Epistemic Drift Rate Knowing

What it measures. The rate at which an agent’s active beliefs diverge from ground truth over a task trajectory.

Operationalization sketch. Construct a trajectory of length N, with N ≥ 40 as a working lower bound for this dimension. Inject controlled ground-truth updates at pre-registered trajectory points — for example T10, T20, and T30 — while preserving a post-injection observation window of at least W ≥ 5 steps per update for measuring belief revision latency. Run this across the three staleness types catalogued in §2.1 (world-state, goal-parameter, capability-contract) rather than treating staleness as monolithic.

Proposed metric. Belief revision latency — the number of trajectory steps between a ground-truth update and the agent’s first action consistent with the update. Measured per staleness type, reported as a distribution rather than a single number, because the pathology of the distribution matters as much as its mean (an agent that revises beliefs within three steps on average but has a long tail of never-revised beliefs is qualitatively more dangerous than one with a higher mean and a thin tail).

Scoring treatment. The metric is censored at trajectory end if no update-consistent action occurs; censored values are reported separately from observed latencies rather than folded into the mean. Stale actions are severity-weighted by scope: internal-only reasoning counts at weight 1, actions that alter the plan at weight 2, and actions that commit external side effects at weight 3. The unweighted distribution is preserved for diagnostic resolution; the severity-weighted aggregate is the headline score. This distinguishes an agent that holds many stale beliefs without acting on them from one that commits externally on a single stale belief — two qualitatively different failures a naive latency mean would confound.

Key failure mode. Superseded-state persistence — the agent continues to act on a superseded belief for the remainder of the trajectory, because the update was observationally accessible but not explicitly surfaced.

Why existing evals miss this. GAIA and τ-bench test single-session retrieval; world-state during evaluation is static. Neither injects mid-trajectory ground-truth updates, because both were designed under the implicit freshness assumption the AEI framework names explicitly. Constructing mid-trajectory updates is not a minor extension to these benchmarks — it is a category shift.

3.2Interruption and Resumption Fidelity Doing

What it measures. Whether an agent can be interrupted mid-task and resume without state corruption, unsafe retries, or capability-state inconsistency.

Operationalization sketch. At a set of randomly-sampled trajectory midpoints, interrupt the agent, clear its in-context state entirely, and restart it with access only to its durable capability-state record and original goal. Observe whether it (a) retries actions that have already been completed, (b) correctly infers what remains from the capability-state record, (c) avoids re-committing side effects to external systems, and (d) correctly distinguishes actions that are safe to resume from actions that require confirmation. The sampled midpoints should be chosen adversarially rather than uniformly — points immediately after side-effect-bearing actions are the most diagnostic.

Proposed metric. Resumption consistency score, decomposed into three sub-scores: duplicate-action error rate (what fraction of already-completed actions does the agent re-attempt — lower is better), remaining-work recall (what fraction of still-required actions does it correctly identify — higher is better), and side-effect recommitment error rate (what fraction of already-committed external effects does the agent attempt to re-commit — lower is better). A headline resumption fidelity score can be derived as 1 − weighted error across the three sub-scores. Reporting sub-scores separately is important because they trade off and the aggregate is less informative than the decomposition.

Scoring treatment. Each sub-score has a natural denominator: duplicate-action error rate over the set of already-completed actions at the interrupt point; remaining-work recall over the set of remaining actions; side-effect recommitment error rate over the set of already-committed external effects. Rates are reported per trajectory and aggregated across trajectories; reporting only the aggregate hides trajectory-level pathology. False-positive aversion — the agent refusing to resume actions that are safe to resume — is captured as an inverse penalty on remaining-work recall, so a maximally timid agent that completes nothing does not post artificially clean error-rate scores.

Key failure mode. Phantom re-execution — the agent re-runs an action that has already produced external side effects, because its capability-state record is incomplete, because it was never written to be durable, or because the agent does not consult it on resumption.

Connection to AEI. This dimension directly operationalizes Prescription 2 of the AEI whitepaper: capability state as a durable runtime artifact. A system whose capability state is ephemeral cannot score well on this dimension; a system whose capability state is durable and queryable can, if the agent is correctly architected to use it. The eval is therefore simultaneously a diagnostic of the architecture and of the agent’s behavior within it.

3.3Mid-Task Replanning Quality Deciding

What it measures. Whether an agent produces coherent, validity-preserving replans when mid-task conditions change through an explicit, observable trigger event.

Operationalization sketch. Inject mid-task constraint changes at pre-registered trajectory midpoints — a new deadline, a revoked permission, a superseding instruction, a capability that becomes unavailable. Unlike the silent updates of §3.1, these are explicit and visible: they arrive as channel events the agent ought to notice. Measure three things: whether the agent detects the constraint change, whether it pauses and replans rather than continuing under the prior assumption, and whether the new plan is coherent with both the original goal and the injected constraint.

Proposed metric. Replan coherence score, produced by human or LLM-as-judge assessment of whether the new plan preserves original intent while correctly incorporating the new constraint. The judging protocol is itself nontrivial — see §5 — but the dimension can be defined and worked against before the judging is fully solved.

Scoring treatment. The judge evaluates three properties separately: constraint incorporation (is the injected constraint reflected in the new plan), intent preservation (does the plan still pursue the original goal where compatible with the new constraint), and internal coherence (is the plan self-consistent). All three must pass for the replan to score as coherent; partial passes are reported on the sub-property that failed rather than collapsed into a scalar. Trajectories in which no replan occurs because no replan was warranted are excluded from the denominator; trajectories in which a replan was warranted but did not occur are scored zero on all three sub-properties, so silent continuation is penalized rather than evaded.

Key failure mode. Silent continuation — the agent detects the new constraint, acknowledges it in its execution trace, explicit plan, or state log, and then continues on a trajectory that now violates it. This is the behavior most consistent with current tuning pressures: models rewarded for task completion learn to treat constraints as obstacles to work around rather than invariants to replan under.

3.4Goal Drift Detection Over Long Horizons Deciding

What it measures. Whether an agent’s active objective remains aligned with the user’s current intent over multi-step trajectories in which intent shifts implicitly and cumulatively, without a single discrete trigger event.

Operationalization sketch. Construct trajectories where the user’s effective preference shifts gradually across N steps — a cost-reduction priority that gives way to speed priority as a deadline slips closer; a privacy-preserving preference that gives way to convenience as the user provides increasingly granular context; a quality-first mandate that quietly relaxes as time pressure mounts. The shifts are not announced. They must be inferred from the cumulative texture of the interaction. Measure: how many steps elapse before the agent’s actions reflect the new priority? And, critically: does the agent surface the inferred intent change to the user for confirmation, or does it silently reoptimize against the new objective without announcing that it has done so?

Proposed metrics. Two are needed, because they trade off. Goal alignment latency — how many steps between a meaningful shift in user intent and the agent’s first action consistent with it. Surface rate — what fraction of detected intent changes the agent explicitly surfaces for confirmation versus silently absorbs. An agent with low latency and low surface rate is responsive but epistemically opaque; an agent with high latency and high surface rate is slow but steerable. Neither dominates, and the evaluation should report both.

Scoring treatment. Surface rate measured naively rewards agents that ask for confirmation on everything, which is neither useful nor steerable. The metric therefore decomposes into surface precision (fraction of surfaced intent changes that correspond to genuine material shifts) and surface recall (fraction of genuine material shifts that are surfaced). Materiality is defined by whether the shift would, if unacted upon, produce a trajectory the user would not endorse at the eval endpoint — operationalized via pre-registered ground-truth intent labels at each trajectory point. Alignment latency is itself materiality-gated: latency on non-material drift is not penalized, so an agent is not punished for continuing through inconsequential noise. This scoring makes it impossible to post strong numbers by either silent smoothness or reflexive confirmation-asking; both collapse into the precision/recall frontier.

Key failure mode. Silent reoptimization — the agent correctly infers that user intent has shifted and smoothly pivots, but without surfacing the pivot for confirmation. The user experiences this as the agent “reading their mind” when it succeeds and as inexplicable divergence when it fails; either outcome undermines the ability to steer the agent through the period in which the intent change is genuinely ambiguous.

Distinction from §3.3. Replanning quality tests response to an explicit constraint injection — a discrete event the agent ought to notice. Goal drift detection tests response to cumulative implicit shift without any single discrete trigger. The distinction matters because the architectural mechanisms that succeed at the first can silently fail the second: an agent that monitors channel events for constraint changes will not detect drift that never produces a channel event.

3.5Multi-Agent Handoff Fidelity Cross-Cutting

What it measures. Whether goal state, capability state, and epistemic state are faithfully transferred when task execution is handed from one agent to another — or from agent to human and back — without the receiving agent filling transfer gaps with plausible but incorrect inferences.

Operationalization sketch. Construct multi-agent pipelines in which a task is decomposed and handed off at predetermined trajectory points. At each handoff, vary the completeness of state transfer along three axes: knowledge (what beliefs are passed, and with what uncertainty annotations), capability state (what record of completed and remaining actions is passed), and goal (what specification of the objective, including validity conditions, is passed). Test not only clean handoffs but adversarial ones in which the transferred state is deliberately incomplete or ambiguous. Measure whether the receiving agent detects the incompleteness, asks for clarification, or confabulates a plausible completion and proceeds.

Proposed metric. Handoff fidelity score, decomposed into three parallel sub-scores: knowledge fidelity (did the receiving agent inherit the correct belief state, including uncertainty annotations, and recognize gaps rather than fill them), capability-state fidelity (did it inherit the correct record of completed and remaining actions), goal fidelity (did it receive and correctly interpret the goal, including validity conditions and constraints). As with resumption fidelity, the decomposition matters — an agent that scores well on knowledge and capability state but poorly on goal fidelity fails in a qualitatively different way than the inverse.

Scoring treatment. Each sub-score is evaluated on an adversarial scenario set in which the transferred state is deliberately incomplete or ambiguous. At each incompleteness point, the receiving agent’s behavior is classified: detecting the incompleteness scores 1, requesting clarification scores 1, confabulating a plausible completion and proceeding scores 0. Receiving agents that request clarification on complete handoffs are not penalized on the fidelity sub-scores — that behavior is captured on a separate clarification-cost axis instead. This prevents the scoring from rewarding silent competence over surfaced uncertainty, which would be exactly the failure mode the dimension is designed to detect.

Key failure mode. Handoff hallucination — the receiving agent fills gaps in transferred state with plausible but incorrect inferences, producing coherent behavior that diverges from the original task intent. This is the multi-agent analogue of single-agent confabulation, and it compounds: in a pipeline of k handoffs, small confabulations at each handoff accumulate into a final state arbitrarily distant from the originating intent.

Connection to AEI. Current trace infrastructure can record and grade whether handoffs occurred, but it generally treats handoff as workflow behavior, not as a state-preservation object [OpenAIEvals]. The missing eval target is whether the receiving agent inherited belief state, capability state, and goal state faithfully enough to continue the trajectory without confabulation. No existing benchmark treats those three as a jointly-evaluated state-transfer integrity object; that is what this dimension addresses. It is also the multi-agent extension of all three AEI primitives simultaneously — the uncertainty audit trail, the capability-state record, and the prospective memory surface each need to survive the handoff intact.

Table 2. The five-dimension trajectory-level eval taxonomy. Each dimension targets a failure mode invisible to turn-level evaluation and maps to an AEI domain or, in the case of handoff fidelity, to all three simultaneously.
DimensionDomainProposed metricKey failure modeGap in existing evals
Epistemic drift rate Knowing Belief revision latency Superseded-state persistence World-state is static during eval
Interruption / resumption fidelity Doing Resumption consistency score Phantom re-execution Interruption and multi-session persistence not tested
Mid-task replanning quality Deciding Replan coherence score Silent continuation Mid-task constraint injection not supported
Goal drift detection Deciding Goal alignment latency + surface rate Silent reoptimization No model of evolving user intent
Multi-agent handoff fidelity Cross-cutting Handoff fidelity score (knowledge / capability / goal) Handoff hallucination Handoff treated as input, not as eval target

4Positioning Against Existing Eval Frameworks

The taxonomy proposed here is complementary to existing frameworks, not competitive with them. Trace- and trajectory-level evaluation is no longer missing from the field; the question is what shape it should take. Existing benchmarks now partially instantiate trajectory-aware evaluation, especially in safety and tool-calling domains. What they do not yet provide is an AEI-level decomposition of long-running state integrity: belief freshness, durable capability state, goal-validity lifecycle, and cross-agent state transfer. I organize the comparison into two groups.

4.1Closest Trajectory-Level Neighbors

ATBench [ATBench] is the closest academic neighbor to the framework proposed here. It is explicitly a trajectory-level safety benchmark — a thousand trajectories constructed to surface risks that emerge over long-context execution, with delayed-trigger scenarios designed to catch unsafe behavior that turn-level evaluation would miss. Its coverage is strongest on Doing and Deciding safety: unsafe tool invocations and risky decisions that compound. Its structural gap, relative to this paper, is that it does not decompose trajectory integrity into the five dimensions proposed here. Epistemic drift rate with controlled ground-truth injection, resumption consistency under interruption, and handoff fidelity are not first-class targets of the current ATBench design. The taxonomy is complementary: ATBench provides concrete adversarial trajectories; this paper provides the decomposition that structures what the benchmark should measure.

FinTrace [FinTrace] operationalizes holistic trajectory-level evaluation in the financial tool-calling domain, with a set of trajectory-level metrics over expert-annotated trajectories. Its framing — measure the arc, not the step — is exactly the framing this paper argues for in the general agentic case. Its scope is the financial domain and the tool-calling subset of Doing. The taxonomy generalizes that framing across Knowing, Doing, Deciding, and handoff, and gives FinTrace-style evaluation a domain-independent scaffold.

Production trace-grading infrastructure. OpenAI’s agent evaluation tooling defines traces as end-to-end records of model calls, tool calls, guardrails, and handoffs, and supports trace-level grading as a first-class operation [OpenAIEvals]. Anthropic’s harness work for long-running agents emphasizes continuity across many context windows and structured artifacts as the substrate for long-horizon execution [AnthropicHarness]. These are engineering patterns, not benchmarks, but they matter here because they establish trace-level grading and long-horizon harness as production realities, not research proposals. The taxonomy proposed in this paper gives that engineering direction a state-integrity target: what the grading should measure, and what properties the harness must expose to make the measurement tractable.

4.2Adjacent Agent and Tool Evals

τ-bench [τ-bench] is strong on session-level tool-correctness in realistic domains. Its blind spot is the long-running regime: it does not test capability-state accumulation across resumptions, inject mid-trajectory world-state updates, or vary handoff completeness. Extending τ-bench in any of these directions is a category shift rather than a parameter adjustment.

GAIA [GAIA] tests reasoning and tool use in single-session settings with static world state. Its regime is session-bounded by design; there is no adjustment to the GAIA setup that would produce a measurement of belief revision latency or resumption consistency, because the phenomena those metrics target do not occur in a single session with static state.

AgentBench [AgentBench] evaluates agents across multiple environments with per-task success as the aggregate metric. Horizontal coverage is the contribution; trajectory depth is the blind spot.

When2Call [When2Call] addresses a question orthogonal to but important for the taxonomy: restraint is a trajectory-level property, and the correct decision not to act is visible only against the alternative trajectories it precludes. MCPVerse [MCPVerse] scales the tool-use benchmark surface; combined with trajectory-level metrics, it makes trajectory evaluation feasible at the scale current agentic systems operate at.

The taxonomy also sits downstream of agent-architecture work rather than competing with it. ReAct [ReAct] and Reflexion [Reflexion] shaped the action/observation/self-correction loop that underlies how agents are built; CoALA [CoALA] and MemGPT [MemGPT] helped formalize memory and cognitive architecture for language agents; prospective-memory theory [Brandimonte] supplies the cognitive baseline for the forward-looking obligation primitive that AEI makes architectural. The present paper asks how those capabilities should be evaluated once the unit of analysis is a long-running trajectory rather than a single session or task — a question the architecture literature was not designed to answer.

4.3Coverage Comparison

Table 3. Coverage comparison across adjacent evaluation frameworks. “Partial” indicates the framework supports the capability in a restricted form (e.g., delayed triggers rather than arbitrary mid-trajectory injection). The proposed taxonomy is the bottom row; it is not itself a benchmark, but it specifies what a benchmark covering all five dimensions would measure.
Framework Eval unit State injection Interruption / resumption Goal drift Handoff fidelity AEI coverage
ATBench Trajectory Partial No No No Doing / Deciding safety
FinTrace Trajectory No No No No Doing (tool calling)
OpenAI trace grading Trace No No No Surface only Engineering substrate
Anthropic harness Episode N/A Engineering pattern, not standardized eval No No Engineering substrate
τ-bench Turn / session No No No No Session Doing
GAIA Episode (single session) No No No No Session Knowing / Deciding
AgentBench Episode No No No No Breadth
When2Call Turn (restraint) No No No No Doing (refusal)
MCPVerse Turn (tool call) No No No No Doing (scale)
This paper Trajectory (state integrity) Yes (three staleness types) Yes Yes Yes Full AEI + cross-cutting

The positioning claim is therefore modest and defensible: no single current benchmark covers the five dimensions proposed here, but several current benchmarks instantiate one or two of them partially, and the field’s direction of travel is toward trajectory-aware evaluation. The taxonomy is offered as a scaffold that direction of travel can be organized by.

5Implementation Considerations and a Harness Blueprint

A taxonomy that cannot be implemented against is decorative. This section addresses four implementation considerations that are, in my judgment, the concrete research problems that gate moving from this framework to a deployable eval harness, and closes with a sketch of the harness architecture itself.

5.1Ground-Truth Injection Without Breaking Ecological Validity

The epistemic drift rate dimension requires mid-trajectory ground-truth updates. Naive injection — a message to the agent saying “the price has changed” — collapses into testing reading comprehension rather than belief revision. The more interesting injection is indirect: a document the agent has access to is updated; a tool’s return values begin reflecting the new state; a peripheral channel carries the update in a form the agent must notice without being told to look. Designing injection protocols that exercise genuine belief revision rather than explicit instruction following is a research problem worth its own paper.

5.2Minimum Trajectory Lengths

Each dimension has a length threshold below which the measurement is not meaningful. Epistemic drift rate requires a trajectory long enough for multiple ground-truth updates to be injected and responded to, with a meaningful observation window after each — I suggest N ≥ 40 as a working threshold. Goal drift detection requires enough cumulative interaction for intent shift to be plausibly inferable — I suggest N ≥ 20. Interruption and resumption fidelity requires multi-session persistence — minimally two sessions separated by full in-context state loss. Handoff fidelity requires at least two agents in a pipeline and is more diagnostic with three or more. Below these thresholds, the measurement can appear to succeed for reasons orthogonal to what the dimension is designed to surface.

5.3LLM-as-Judge at Trajectory Scope

Turn-level LLM-as-judge protocols are well-studied: the judge is shown an input and an output and asked to score. Trajectory-level judging is a different object. The judge must reason about state across N steps, hold in mind what the user originally asked and how their stated or implied intent has shifted, and assess whether the agent’s cumulative behavior preserves intent under constraint change. This is a research problem — not an engineering problem — because the judge’s capability to do this assessment is itself a function of the same long-context and state-tracking limitations that motivate the taxonomy in the first place. An honest implementation will acknowledge that early trajectory-level evals will be partially judge-limited and will need to combine automated judging with human spot-checking until judging capability catches up.

5.4Annotation Cost and Partial Automation

Trajectory-level eval is dramatically more expensive per data point than turn-level. A single trajectory of N = 30 steps with three ground-truth injections and a handoff costs roughly an order of magnitude more to annotate than a comparable set of individual turns. This cost pressure is real, and the field will need evaluation architectures that partially automate trajectory assessment. The two-stage memory generation architecture I described in earlier work [TwoStage] is a pattern worth applying here: async, offline processing over the trajectory to extract structured evaluation signals, followed by online human or judge review of only the signals that warrant it. The same logic that made memory generation tractable at scale can make trajectory evaluation tractable at scale; the mechanism differs, but the architectural pattern carries over.

5.5Harness Blueprint

A harness capable of evaluating all five dimensions decomposes into ten components. Each component has a single purpose and maps to a definite subset of the taxonomy. The blueprint is deliberately spare: each component is architectural, not implementation-specific, and multiple implementations can satisfy the same component contract. This is the stable-what treatment applied to the harness itself.

Table 4. Harness component blueprint. Each component is defined by its purpose and the taxonomy dimensions it supports. The mapping is many-to-many: most dimensions require multiple components, and most components contribute to multiple dimensions.
Component Purpose Dimensions supported
Scenario generator Produce trajectories with pre-registered event structures, goal specifications, and validity conditions All
Ground-truth ledger Track authoritative world state, goal state, and user intent at each trajectory step 3.1, 3.3, 3.4
Event injector Inject world-state updates, constraint changes, and implicit intent shifts at pre-registered points 3.1, 3.3, 3.4
Checkpoint / resumption manager Interrupt agents mid-trajectory and restart them from durable state only, with in-context memory cleared 3.2
Side-effect sandbox Capture external commitments so the harness can distinguish already-committed effects from safe-to-replay ones 3.2
Handoff corruptor Produce adversarial handoff scenarios with partial or ambiguous state transfer 3.5
State probes Query the agent’s belief, capability, and goal state at any trajectory point. Probes should read externalized runtime artifacts rather than prompting the agent to explain itself mid-run; otherwise the probe can perturb the trajectory being measured. All
Deterministic graders Score metrics with closed-form answers — latency, duplicate-action rate, side-effect commitment rate 3.1, 3.2
LLM judge Score trajectory-scope properties requiring state reasoning — replan coherence, surface precision, handoff confabulation 3.3, 3.4, 3.5
Human adjudication queue Review judge decisions flagged as ambiguous or low-confidence, with structured disagreement capture 3.3, 3.4, 3.5

Two observations about the blueprint. First, the ground-truth ledger and the state probes are the load-bearing components: every other component either writes into the ledger (scenario generator, event injector) or reads from it for scoring (graders, judge, adjudication queue). A harness whose ledger is weakly specified cannot evaluate any dimension soundly, because the eval reduces to comparing the agent’s state against an uncertain reference. Second, the deterministic-grader / LLM-judge split roughly tracks the taxonomy’s own split between closed-form and interpretive dimensions. This is a structural feature worth preserving: trajectories in which all dimensions can be scored deterministically are useful for regression testing; trajectories requiring judge adjudication are where the field’s current evaluation research should concentrate.

6Conclusion

This paper proposes a trajectory-level eval taxonomy for long-running agentic workflows, organized around five dimensions: epistemic drift rate, interruption and resumption fidelity, mid-task replanning quality, goal drift detection, and multi-agent handoff fidelity. The first four map to the three AEI domains of Knowing, Doing, and Deciding. The fifth is cross-cutting and addresses the multi-agent extension the AEI framework acknowledges but does not develop. Each dimension targets a failure mode — superseded-state persistence, phantom re-execution, silent continuation, silent reoptimization, handoff hallucination — that is invisible at turn scope and only partially covered by current trajectory-aware benchmarks. The claim is not that no one is evaluating trajectories; it is that the field lacks a stable taxonomy for evaluating the integrity of the state that long-running trajectories carry.

The contribution is conceptual. I have argued for what should be measured, not for a specific measurement instrument. The next step is the harness: synthetic trajectory generation with controlled mid-task injections, adversarial handoff scenarios, trajectory-scale LLM-as-judge protocols, and the annotation architectures that make all of this tractable at the scale frontier systems operate at. That work is the research agenda that follows from this framework, not a prerequisite for it. The contribution here is the stable what: the requirement that long-running agents be evaluated at trajectory scope, decomposed along dimensions that reflect the state-integrity obligations that scope reveals. The volatile how — benchmarks, harnesses, metric formalizations — is the engineering and research agenda the framework organizes.

References

Research notes, half-baked ideas. Probably overthought, definitely over-architected.