Agent Memory Architectures

The first wave of agent demos pretended memory was a single thing: stuff context in, retrieve relevant parts, done. Production agents need three distinct memory layers, each with its own storage, its own retrieval, and its own write rules.

The three layers#

Working memory — what the agent has been doing in this task. Lives in the active context window. Cleared when the task ends. This is the easy one.

Episodic memory — what the agent (or this user) did in past tasks. Logs of prior interactions, decisions, outcomes. Stored in a queryable database, retrieved by recency + relevance. Persistent across sessions.

Semantic memory — what the agent knows about the domain. The knowledge base. Stored in a vector store or structured DB, retrieved by similarity. Shared across users and sessions. Closest to traditional RAG.

Each has different read frequency, write rules, and update cadence. Conflating them is the most common architecture mistake.

Working memory#

Implementation: the active context window plus a structured scratchpad. The scratchpad is explicitly typed — current goal, intermediate results, open questions — not the full chat history.

Trim aggressively. We summarize older turns into the scratchpad after ~4k tokens of trace. The full trace goes to the audit log, not back into the prompt.

Episodic memory#

This is where most teams get it wrong. They either (a) ignore it and the agent forgets the user every session, or (b) dump every prior interaction into context, drowning the model.

The pattern that works:

Store each task as a structured event: {user, timestamp, intent, result, metadata}
At task start, retrieve the last N events for this user plus any events whose embedding is similar to the current request
Surface them as a short “context” block, not raw transcripts: “User asked about X yesterday, you answered Y, they followed up with Z”

Episodic memory is what makes an agent feel like it knows you. Without it, every session is amnesia.

Semantic memory#

The knowledge base. Domain documents, internal wikis, product catalogs, policy documents. Standard RAG plumbing — see our vector store comparison.

Two things distinguish good semantic memory:

Chunking that matches retrieval intent. If users ask “how do I X”, chunks should be self-contained answers, not arbitrary 500-token slices. Test retrieval on the questions you actually expect.

Update discipline. Stale knowledge is worse than no knowledge. Either rebuild the index on every doc change, or track doc-version IDs so retrieval can include “this was updated 3 days ago” signals.

When to write to memory#

Working memory: continuously, structured. Episodic memory: at task completion. Write the outcome, the user’s apparent intent, and any decisions worth replaying. Don’t write the full transcript. Semantic memory: rarely. Domain knowledge updates on a schedule, not per-task. Letting an agent write to semantic memory based on user inputs is how you poison your own knowledge base.

The cross-layer retrieval pattern#

At task start, our default retrieval looks like:

Pull last N episodic events for user (recency)
Pull top-K episodic events by embedding similarity to current request
Pull top-K semantic chunks by embedding similarity
Construct the prompt with: system instructions → semantic context → episodic context → current request

Each section is clearly demarcated so the model treats episodic (“what happened before”) differently from semantic (“what is true in the world”).

Where it goes wrong#

Mixing layers. Storing semantic knowledge in episodic logs. Storing user-specific events in the shared knowledge base. Both poison retrieval.

Unbounded growth. Episodic memory that retains every interaction forever. Either summarize old events into a higher-level “user profile” object, or expire by policy.

No privacy boundary. One user’s episodic memory leaking into another user’s retrieval. Always partition by user ID at the retrieval layer.

Vector-store hammer. Not everything belongs in a vector store. User profile attributes, current task state, structured policy — these are Postgres rows, not embeddings.

What we ship by default#

For agent engagements via our AI & LLM integration service:

Working memory in a typed scratchpad (~3k token budget, summarized aggressively)
Episodic memory in Postgres with embeddings via pgvector
Semantic memory in pgvector or Pinecone depending on scale
User-partitioned retrieval, audit-logged
Explicit write rules per layer; agents don’t write to semantic memory

Memory isn’t a feature — it’s an architecture. Get it right early.

An agent without memory feels like a chatbot. An agent with the right memory feels like a coworker. Our team installs production-grade memory across enterprise agents. Tell us about the workflow.

The three layers#

Working memory#

Episodic memory#

Semantic memory#

When to write to memory#

The cross-layer retrieval pattern#

Where it goes wrong#

What we ship by default#

Related posts.

LangChain vs LlamaIndex: A 2026 Engineering Decision Guide

Building Production AI Agents: The Architecture Patterns That Actually Ship

Vector Database Migration: Pinecone to Postgres (and Vice Versa) in 2026