Two-Stage Agentic Memory Generation Architecture
Why separating memory generation from retrieval is an economic and epistemic necessity — and where the real engineering challenge lies.
What's Wrong with Retrieve-at-Query-Time
The retrieve-at-query-time pattern dominates production memory systems for pragmatic reasons: it is simple to implement, it handles the conversational case adequately, and it places the memory burden entirely on the retrieval layer, which is a well-studied problem. For session-based assistants handling a bounded conversation, it is often sufficient.
It fails in three ways as systems scale:
Latency budget: Retrieving and reasoning over raw source material — emails, documents, prior conversations — inside the inference latency budget means the user waits for memory assembly on every query. The expensive work happens in the most latency-sensitive window.
Signal density: Raw source material is low-signal-density. Most content in an email thread, a document, or a conversation transcript is not relevant to any given query. Retrieval over raw sources requires the model to do significant filtering and synthesis work at query time — work that is duplicated on every subsequent query that touches the same source.
No amortization: If the same source material is relevant to 50 downstream queries, the retrieve-at-query-time pattern pays the synthesis cost 50 times. The two-stage architecture pays it once.
Two Stages with Distinct Responsibilities
- Reads raw source material from all available streams
- Extracts and structures entries by memory tier (semantic / episodic / procedural / prospective)
- Assigns provenance, authority weight, and confidence
- Stores entries in the memory store; does not touch the context window
- One run serves N downstream queries
- Queries memory store by tier, recency, relevance, and authority
- Applies token budget constraints across tiers
- Proactively surfaces prospective items regardless of query similarity
- Assembles and injects context payload into the prompt
- Falls back to a lightweight staleness gate when generation cadence lags
The separation is not merely an engineering optimization. It reflects an epistemologically correct framing of the enterprise memory problem: the reasoning required to extract signal from noise — determining what from a large source artifact is worth preserving — is too expensive to perform inside the inference latency budget at scale. That reasoning belongs in Stage 1, where it can be scheduled, cached, and amortized.
The core economic claim: one Stage 1 generation run serves many Stage 2 selection operations. The amortization ratio is the primary lever for inference cost management.
per source event
each generation run
query at Stage 2
Why This Architecture Gets Better Automatically
The two-stage architecture is model-forward by construction. The mechanism is precise: Stage 1 generation quality is a direct function of the foundation model's reasoning capability. When the foundation model improves — through a model update, a capability jump, or a fine-tuned variant — Stage 1 produces better memory entries from the same source material, without any change to the Stage 2 retrieval layer or the memory store schema.
This contrasts with architectures that embed memory logic in prompt templates or retrieval heuristics: those must be re-engineered when the model changes, because the logic is coupled to a specific model's behavior. The two-stage architecture decouples memory quality from model specifics. The model executes the generation contract; it does not define it.
Agentic Retrieval for Non-Conversational Sources
The two-stage architecture is well-understood for the conversational case: a human-bot conversation produces a transcript, Stage 1 reads it and extracts structured memory entries. The transcript is the trace, and the trace is human-readable, relatively dense with signal, and structurally familiar.
The open engineering problem is what happens when the source stream is not a conversation.
Human-Bot Interaction Transcripts
Dense with signal relative to volume. Human-generated intent is explicit. "Process everything" is computationally tractable and frequently correct. The state of the art here is reasonably mature.
Agentic Artifact Streams
MCP tool outputs, plan execution traces, multi-agent exchange logs, intermediate task states. High volume, low signal density, often structurally ambiguous. "Process everything" is expensive and produces high noise. The published architecture for this class is sparse.
For agentic artifact streams, Stage 1 cannot simply read and summarize. It must determine, agentically, what from this tool trace or plan execution record is worth preserving — and that determination requires understanding task intent, execution outcome, and downstream relevance across a heterogeneous stream.
This is a genuinely different retrieval problem from the conversational case. Enterprise-scale deployments that operate through non-human-bot infrastructure generate artifact streams that are orders of magnitude larger than conversation transcripts, with far lower signal density per token. The systems-level challenge — how to build a Stage 1 process that is both selective and comprehensive across these streams — does not yet have a canonical solution in the published literature.
What This Implies for System Builders
Treat Stage 1 as infrastructure, not a pipeline step
Stage 1 should be event-triggered (runs when new source material arrives), independently scalable (generation load is decoupled from inference load), and continuously monitored (the amortization ratio — how many queries each generation run serves — is your primary economic health metric). Systems that implement Stage 1 as a synchronous step in the inference path have defeated the architecture's purpose.
Source authority is a first-class input to Stage 1
Not all source material carries equal epistemic weight. A formal decision record carries higher authority than a brainstorming chat message; a calendar invite carries higher commitment weight than a verbal agreement in a transcript. Stage 1 must tag every generated memory entry with provenance and authority weight. Stage 2 uses these signals during selection to calibrate confidence. Systems that treat all sources as equivalent produce memory entries of wildly varying reliability that are indistinguishable at retrieval time.
Design Stage 1 separately for conversational and agentic sources
The conversational case and the agentic artifact case have different signal densities, different structural vocabularies, and different notions of "what matters." A unified Stage 1 that treats both as raw text to summarize will underperform on both. The agentic artifact case in particular requires task-intent understanding to determine which intermediate states and tool outputs are worth preserving — this is closer to an evaluation problem than a summarization problem.
Research notes, half-baked ideas. Probably overthought, definitely over-architected.