A Four-Tier Memory Taxonomy for Enterprise Agentic Systems

Abstract

The dominant three-tier memory taxonomy used in AI agent research — semantic, episodic, and procedural — is insufficient for enterprise agentic systems. Systems built on three tiers drop a specific class of information at session boundaries: forward-looking commitments, pending decisions, and time-sensitive obligations. This paper introduces prospective memory as the missing fourth tier, grounds it in cognitive science literature, and argues that its absence is the primary architectural reason enterprise agents fail at handoff, delegation, and multi-session task continuity. Each tier is defined with distinct storage semantics, retrieval patterns, and expiry logic. The framework also identifies the session-to-persistent transition as the stress test that reveals which tiers a system's memory architecture actually implements versus which it assumes the model will handle implicitly.

The Problem

What Falls Through the Cracks

Enterprise agentic systems lose information at a specific, predictable point: the session boundary. A user asks an agent to draft a proposal, send it by Thursday, and schedule a follow-up if there's no response by Friday. The agent executes the first action. When the session ends, the Thursday deadline and Friday follow-up condition are gone — not because the model forgot them, but because the system had no designated place to store forward-looking commitments.

This failure is not a retrieval quality problem. It is a taxonomy problem. Standard memory architectures don't have a tier for "things the agent is supposed to do or check in the future." That gap has a name in cognitive science: prospective memory — memory for intended future actions, as distinct from memory for past facts, events, or behavioral patterns.

The dominant frameworks in AI agent research (CoALA, MemGPT, the broader survey literature) enumerate three memory types: semantic (facts), episodic (events), and procedural (skills). CoALA adds a working memory tier as a fourth. This paper argues that working memory is an assembly artifact — the output of the retrieval process, not a storage tier — and that the genuinely missing tier is prospective, not working.

The Framework

Four Tiers, Four Distinct Problems

Each tier answers a different question about the agent's cognitive state. The tiers are not alternatives — they are all necessary, and none is sufficient alone.

Tier 1 Semantic Memory WHO / WHAT

Compressed, decontextualized facts about the user's world: roles, relationships, domains of expertise, organizational context, communication preferences. These are durable beliefs that should persist across all sessions and be invalidated only when a fact explicitly changes.

StorageEntity graph + confidence-weighted fact store

RetrievalAlways-on background injection; stable across sessions

ExpiryInvalidated by explicit fact change; low decay; conflict resolved by source authority

Failure without itAgent asks the same clarifying questions every session; treats known facts as unknown

"Alex is the project lead for Q3 infrastructure. You typically communicate with them by Slack, not email. They prefer decisions presented with tradeoffs, not just recommendations."

Tier 2 Episodic Memory WHAT HAPPENED

Time-indexed records of specific events, interactions, decisions, and outcomes — the agent's personal experience stream. Episodic memory provides the context for interpreting current requests in light of prior history, including unresolved threads and past commitments made by the user.

StorageTimestamped event records with participant and decision tags

RetrievalQuery-conditioned, recency-boosted; full-text + semantic search

ExpiryRolling window with recency-wins conflict resolution

Failure without itAgent cannot connect current request to prior context; treats each session as a fresh start

"On March 10 you met with the infra team. Budget concerns were raised about the Q3 timeline. No decision was reached. A follow-up was scheduled for the following week but not confirmed."

Tier 3 Procedural Memory HOW TO ACT

Behavioral patterns distilled from repeated observed actions — the user's implicit preferences for how tasks should be executed. Procedural memory is stored as structured skill definitions inferred from behavioral signals, not as passive embeddings of prior text. It answers: "Given this task type, how does this user prefer it done?"

StorageSkill definitions; updated on sufficient contradicting behavioral signals

RetrievalImplicit injection on task-type match; not explicitly surfaced to user

ExpirySlow decay; updated by frequency-of-contradiction, not recency

Failure without itAgent re-learns user preferences every session; doesn't adapt to communication style or workflow patterns

"When drafting emails to senior stakeholders, this user leads with the ask, keeps to three sentences max, and uses bullet points for action items. Never uses passive voice in subject lines."

Tier 4 Prospective Memory WHAT'S COMING

Forward-looking commitments, deadlines, pending decisions, and intended future actions — memory for things the agent is supposed to do or check. This is the tier absent from most deployed systems and the tier most consequential for always-on operation. In cognitive science, prospective memory is well-established as a distinct memory system; its absence in AI agent architectures is the primary reason agents drop deferred tasks at session boundaries.

StorageCommitment records with deadline, condition, and owning-intent fields; lifecycle: pending → active → resolved / expired

RetrievalProactive surfacing before deadlines; injected even when not queried; transitions to episodic on resolution

ExpiryHard TTL plus grace period; explicit resolution by user or agent; expired items archived to episodic with outcome tag

Failure without itAgent drops delegated tasks, missed deadlines, and conditional follow-ups at every session boundary — the most common enterprise complaint about AI agents

"Proposal draft due Thursday EOD. If no client response by Friday 5pm, send a follow-up and notify the account lead. Budget approval decision expected from finance team by end of month — flag if not received."

Comparison

Against Existing Frameworks

The table below positions the four-tier taxonomy against the most commonly cited alternatives in the agent memory literature.

Framework	Tiers / Types	Missing	Notes
This paper	Semantic · Episodic · Procedural · Prospective	—	Prospective as a first-class storage tier with distinct retrieval and expiry semantics
CoALA (2023)	Semantic · Episodic · Procedural · Working	Prospective	Working memory is treated as a storage tier; this paper treats it as a retrieval artifact
MemGPT (2023)	In-context (working) · External (archival)	Semantic · Procedural · Prospective	Engineering-focused; storage topology over cognitive taxonomy
Survey literature (2024–25)	Semantic · Episodic · Procedural	Prospective	Standard three-tier framing; adequate for session-scoped agents, insufficient for persistent ones
Cognitive science baseline	Semantic · Episodic · Procedural · Prospective	—	All four tiers well-established in human memory research; AI agent literature has lagged in adopting prospective

The Deeper Claim

Prospective Memory as the Steerability Primitive

The prospective memory tier is not only a storage concern. It is the primitive that makes long-horizon agent steerability tractable.

Steerability — the ability for authorized parties to correct an agent's behavior mid-task — requires a live representation of what the agent intends to do next. Without prospective memory, the agent's future action plan is implicit: it lives in the model's forward pass, which cannot be inspected, paused, or corrected without terminating the task entirely. With prospective memory as a first-class system artifact, human interrupts have a well-defined insertion point. Correction becomes additive rather than destructive.

This is the connection between memory architecture and governance that existing frameworks do not make explicit. The four-tier taxonomy is not just about recall quality — it is about whether a persistent agent's behavior can be understood and corrected by the humans responsible for it.

Companion artifact (for illustrative purposes): The retrieval and injection patterns described in this paper — particularly the always-on background injection of semantic memory and the proactive surfacing of prospective items before their deadlines — can be encoded as operational instructions in a SKILL.md-style artifact. Such a companion document would specify Stage 2 inference-time behavior: how each tier is queried, filtered, and assembled into a context window. It is distinct from the memory generation architecture (covered in a separate paper) and from the tier definitions here. This separation of concerns — taxonomy, generation, and retrieval as distinct artifacts — is itself a design recommendation.

Design Prescriptions

What This Implies for System Builders

Treat all four tiers as first-class storage concerns, not retrieval strategies. Each tier requires its own storage representation, expiry semantics, and conflict resolution logic. Systems that implement one flat memory store and rely on retrieval quality to sort out the rest will consistently drop prospective items — not because retrieval fails, but because prospective records have no natural similarity to the current query and will never surface via similarity search alone.

Prospective memory requires proactive injection, not reactive retrieval. The failure mode is not "user asks about a deadline and the agent can't find it." The failure mode is "the deadline expires and the agent never notices." Prospective items must be surfaced on a cadence, not on demand. This requires a background monitoring process, not a retrieval pipeline.

Source authority is a first-class provenance attribute. A formal commitment in a calendar invite carries different epistemic weight than a commitment mentioned casually in a chat message. Memory generation must track source authority, and the retrieval layer must calibrate confidence accordingly. Systems that treat all sources as equivalent will produce memory entries of wildly varying reliability indistinguishably.

The session boundary is the taxonomy stress test. A simple diagnostic: run the system across a session boundary and check which commitments, decisions, and intended future actions survive. If the answer is "only the ones the user explicitly re-states," the system has a prospective memory gap, regardless of how sophisticated its other memory tiers are.

Research notes, half-baked ideas. Probably overthought, definitely over-architected.