Systems Design · Agentic Memory

Two-Stage Agentic Memory Generation Architecture

Why separating memory generation from retrieval is an economic and epistemic necessity — and where the real engineering challenge lies.

Feb 2026 Memory Architecture Enterprise AI Systems Design
Abstract
The dominant pattern in production agentic memory systems — retrieve relevant context at query time, inject into the prompt — is architecturally fragile: it runs expensive reasoning inside the inference latency budget, treats generation and selection as one operation, and degrades under query volume. This paper argues for a two-stage separation: an asynchronous, query-independent generation phase that produces standing memory representations, and an inference-time selection phase that assembles context from those representations. The economic structure is fundamentally different — one generation run amortizes across many downstream queries — and the architecture is model-forward by construction: as foundation models improve, so does generation quality without re-engineering the retrieval layer. The paper further identifies the agentic retrieval problem for non-conversational artifact streams as the genuinely open frontier in memory generation, distinct from the well-studied conversational memory case.

What's Wrong with Retrieve-at-Query-Time

The retrieve-at-query-time pattern dominates production memory systems for pragmatic reasons: it is simple to implement, it handles the conversational case adequately, and it places the memory burden entirely on the retrieval layer, which is a well-studied problem. For session-based assistants handling a bounded conversation, it is often sufficient.

It fails in three ways as systems scale:

Latency budget: Retrieving and reasoning over raw source material — emails, documents, prior conversations — inside the inference latency budget means the user waits for memory assembly on every query. The expensive work happens in the most latency-sensitive window.

Signal density: Raw source material is low-signal-density. Most content in an email thread, a document, or a conversation transcript is not relevant to any given query. Retrieval over raw sources requires the model to do significant filtering and synthesis work at query time — work that is duplicated on every subsequent query that touches the same source.

No amortization: If the same source material is relevant to 50 downstream queries, the retrieve-at-query-time pattern pays the synthesis cost 50 times. The two-stage architecture pays it once.


Two Stages with Distinct Responsibilities

Stage 1 — Generation Async · Query-independent · Amortized
Runs offline, triggered by new source events (new email, completed task, meeting transcript, tool execution record). Produces standing memory representations — structured entries tagged by memory tier — without knowing which queries they will serve.
  • Reads raw source material from all available streams
  • Extracts and structures entries by memory tier (semantic / episodic / procedural / prospective)
  • Assigns provenance, authority weight, and confidence
  • Stores entries in the memory store; does not touch the context window
  • One run serves N downstream queries
Stage 2 — Selection Inference-time · Query-conditioned · Fast
Runs inside the inference latency budget. Receives the current query and selects relevant entries from the pre-built memory store. Does not generate; does not read raw sources. Assembles a structured context payload for injection.
  • Queries memory store by tier, recency, relevance, and authority
  • Applies token budget constraints across tiers
  • Proactively surfaces prospective items regardless of query similarity
  • Assembles and injects context payload into the prompt
  • Falls back to a lightweight staleness gate when generation cadence lags

The separation is not merely an engineering optimization. It reflects an epistemologically correct framing of the enterprise memory problem: the reasoning required to extract signal from noise — determining what from a large source artifact is worth preserving — is too expensive to perform inside the inference latency budget at scale. That reasoning belongs in Stage 1, where it can be scheduled, cached, and amortized.

Economics of the Two-Stage Structure

The core economic claim: one Stage 1 generation run serves many Stage 2 selection operations. The amortization ratio is the primary lever for inference cost management.

Generation cost
per source event
Queries served from
each generation run
~O(1)
Marginal cost per
query at Stage 2

Why This Architecture Gets Better Automatically

The two-stage architecture is model-forward by construction. The mechanism is precise: Stage 1 generation quality is a direct function of the foundation model's reasoning capability. When the foundation model improves — through a model update, a capability jump, or a fine-tuned variant — Stage 1 produces better memory entries from the same source material, without any change to the Stage 2 retrieval layer or the memory store schema.

This contrasts with architectures that embed memory logic in prompt templates or retrieval heuristics: those must be re-engineered when the model changes, because the logic is coupled to a specific model's behavior. The two-stage architecture decouples memory quality from model specifics. The model executes the generation contract; it does not define it.

A practical implication: the two-stage architecture gives teams a clear upgrade path. When a better foundation model becomes available, Stage 1 can be re-run over historical source material to regenerate memory entries at higher quality — without touching Stage 2 or the downstream inference infrastructure. This is the operational meaning of model-forward.

Agentic Retrieval for Non-Conversational Sources

The two-stage architecture is well-understood for the conversational case: a human-bot conversation produces a transcript, Stage 1 reads it and extracts structured memory entries. The transcript is the trace, and the trace is human-readable, relatively dense with signal, and structurally familiar.

The open engineering problem is what happens when the source stream is not a conversation.

Conversational Sources

Human-Bot Interaction Transcripts

Dense with signal relative to volume. Human-generated intent is explicit. "Process everything" is computationally tractable and frequently correct. The state of the art here is reasonably mature.

Non-Conversational Sources · Open Frontier

Agentic Artifact Streams

MCP tool outputs, plan execution traces, multi-agent exchange logs, intermediate task states. High volume, low signal density, often structurally ambiguous. "Process everything" is expensive and produces high noise. The published architecture for this class is sparse.

For agentic artifact streams, Stage 1 cannot simply read and summarize. It must determine, agentically, what from this tool trace or plan execution record is worth preserving — and that determination requires understanding task intent, execution outcome, and downstream relevance across a heterogeneous stream.

This is a genuinely different retrieval problem from the conversational case. Enterprise-scale deployments that operate through non-human-bot infrastructure generate artifact streams that are orders of magnitude larger than conversation transcripts, with far lower signal density per token. The systems-level challenge — how to build a Stage 1 process that is both selective and comprehensive across these streams — does not yet have a canonical solution in the published literature.


What This Implies for System Builders

Treat Stage 1 as infrastructure, not a pipeline step

Stage 1 should be event-triggered (runs when new source material arrives), independently scalable (generation load is decoupled from inference load), and continuously monitored (the amortization ratio — how many queries each generation run serves — is your primary economic health metric). Systems that implement Stage 1 as a synchronous step in the inference path have defeated the architecture's purpose.

Source authority is a first-class input to Stage 1

Not all source material carries equal epistemic weight. A formal decision record carries higher authority than a brainstorming chat message; a calendar invite carries higher commitment weight than a verbal agreement in a transcript. Stage 1 must tag every generated memory entry with provenance and authority weight. Stage 2 uses these signals during selection to calibrate confidence. Systems that treat all sources as equivalent produce memory entries of wildly varying reliability that are indistinguishable at retrieval time.

Design Stage 1 separately for conversational and agentic sources

The conversational case and the agentic artifact case have different signal densities, different structural vocabularies, and different notions of "what matters." A unified Stage 1 that treats both as raw text to summarize will underperform on both. The agentic artifact case in particular requires task-intent understanding to determine which intermediate states and tool outputs are worth preserving — this is closer to an evaluation problem than a summarization problem.

Research notes, half-baked ideas. Probably overthought, definitely over-architected.