LLMs are stateless. Agents aren’t. Here’s what sits in between.

Table of contents

  1. Introduction
  2. Why memory matters
  3. Taxonomy of agent memory
  4. Short-term memory: the context window
  5. Long-term memory: persistence beyond a session
  6. Episodic memory: what happened
  7. Semantic memory: structured knowledge
  8. Procedural memory: learned behaviors
  9. Memory storage backends
  10. Retrieval strategies
  11. Memory management: writing, updating, and forgetting
  12. Architectural patterns in practice
  13. Challenges and open problems
  14. Wrapping up

Introduction

LLMs are stateless by design. Each API call is independent — the model has no mechanism to remember what happened in a previous request. But somehow, the agents built on top of these models maintain context across long conversations, recall user preferences from weeks ago, and build on lessons from past tasks.

The gap is filled by external memory systems — architectures layered on top of LLMs that handle storage, retrieval, and reasoning over information across time. I want to dig into how these systems actually work under the hood.

Why memory matters

Without memory, an AI agent is limited to what fits in a single prompt. Every interaction starts from zero. The agent can’t reference prior conversations, user preferences disappear between sessions, and the same mistakes get repeated because there’s no mechanism to learn from them. Even within a single session, long tasks can blow past the context window (128K or 200K tokens), and early information just gets dropped.

Memory is what turns a stateless LLM into something that can do sustained work over time.

Taxonomy of agent memory

Agent memory systems borrow from cognitive science. The most common taxonomy breaks memory into these categories:

                        Agent Memory
                             |
            +----------------+----------------+
            |                |                |
       Short-Term       Long-Term        Working
       (In-Context)    (Persistent)      (Scratchpad)
                             |
              +--------------+--------------+
              |              |              |
          Episodic       Semantic      Procedural
       (Experiences)   (Knowledge)    (Skills/Plans)

Production agent systems typically combine several of these.

Short-term memory: the context window

How it works

The simplest form of agent memory is the conversation history passed to the LLM in each request. This is sometimes called “in-context memory.”

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "My name is Alex."},
    {"role": "assistant", "content": "Nice to meet you, Alex!"},
    {"role": "user", "content": "What is my name?"},
    # The model can answer "Alex" because the prior turn is in context.
]

Characteristics of in-context memory

PropertyDetail
CapacityFixed (context window size, e.g. 200K tokens)
LatencyZero retrieval cost — already in prompt
PersistenceSession-scoped; lost when context is cleared
FidelityPerfect recall of everything in the window

Context window management

When conversations exceed the context window, agents must decide what to keep. Common strategies include:

  • Sliding window: drop the oldest messages, keep the most recent N tokens.
  • Summarization: compress older messages into a summary that consumes fewer tokens.
  • Selective retention: use heuristics or a secondary model to score message importance and keep high-value turns.
# Simplified summarization-based compression
def compress_context(messages, max_tokens):
    while count_tokens(messages) > max_tokens:
        oldest_chunk = messages[1:5]  # skip system prompt
        summary = llm.summarize(oldest_chunk)
        messages = [messages[0]] + [{"role": "system", "content": summary}] + messages[5:]
    return messages

Working memory (scratchpad)

A variant of short-term memory is the scratchpad — a dedicated section of the prompt where the agent can write intermediate reasoning, plans, or partial results. Frameworks like ReAct and chain-of-thought prompting use this implicitly. Some systems make it explicit:

<scratchpad>
- User wants to refactor the auth module
- I found 3 files: auth.py, middleware.py, tokens.py
- auth.py has a circular import with middleware.py
- Plan: extract shared types into a new types.py
</scratchpad>

Long-term memory: persistence beyond a session

Long-term memory is any mechanism that lets an agent retain information across separate sessions or conversations. This is where things get architecturally interesting.

The core loop

Every long-term memory system follows a variation of this loop:

              +------------------+
              |   Agent Action   |
              +--------+---------+
                       |
              +--------v---------+
              |  Memory Writer   |  <-- Decides what to store
              +--------+---------+
                       |
              +--------v---------+
              |  Memory Store    |  <-- Vector DB, file system, graph DB
              +--------+---------+
                       |
              +--------v---------+
              |  Memory Retriever|  <-- Fetches relevant memories
              +--------+---------+
                       |
              +--------v---------+
              |  Context Builder |  <-- Injects memories into prompt
              +------------------+

The agent generates new information during a task. A memory writer module decides what is worth persisting. The data is stored in a memory store. On future requests, a retriever fetches relevant memories and a context builder injects them into the LLM prompt.

Episodic memory: what happened

Episodic memory stores records of specific events or interactions. Think of it as the agent’s autobiography — what happened, when, and how it turned out.

What an episodic memory entry looks like

A typical entry looks like this:

{
  "id": "mem_abc123",
  "timestamp": "2026-03-08T14:32:00Z",
  "session_id": "sess_xyz",
  "event_type": "task_completed",
  "content": "Successfully deployed the user's Next.js app to Vercel after fixing a build error in next.config.js related to incorrect output directory.",
  "entities": ["Next.js", "Vercel", "next.config.js"],
  "outcome": "success",
  "embedding": [0.0123, -0.0456, ...]  // vector representation
}

When this matters

  • Avoiding repeated mistakes: “Last time I tried approach X on this codebase, it failed because of Y.”
  • Building on past work: “In a previous session, I set up the database schema. Let me retrieve those details.”
  • Learning from corrections: “The user corrected my formatting style three times — they prefer single quotes.”

Implementation

class EpisodicMemory:
    def __init__(self, vector_store, embedding_model):
        self.store = vector_store
        self.embed = embedding_model

    def record(self, event: str, metadata: dict):
        vector = self.embed(event)
        self.store.upsert(
            id=generate_id(),
            vector=vector,
            payload={"content": event, "timestamp": now(), **metadata}
        )

    def recall(self, query: str, top_k: int = 5) -> list:
        query_vector = self.embed(query)
        results = self.store.search(query_vector, top_k=top_k)
        return [r.payload for r in results]

Semantic memory: structured knowledge

Semantic memory stores factual knowledge and relationships. If episodic memory is “what happened,” semantic memory is “what I know.”

Representations it can take

It can take multiple forms:

1. Key-value facts

user_preferences:
  language: Python
  framework: FastAPI
  formatting: black
  test_runner: pytest

project_context:
  name: "acme-api"
  database: PostgreSQL
  deployment: AWS ECS

2. Knowledge graphs

(User) --[prefers]--> (Python)
(Project) --[uses]--> (PostgreSQL)
(PostgreSQL) --[hosted_on]--> (AWS RDS)
(auth_module) --[depends_on]--> (jwt_library)

3. Embedding-indexed documents

Chunks of documentation, code, or notes stored with vector embeddings for semantic search.

Knowledge graph implementation

class SemanticMemory:
    def __init__(self, graph_db):
        self.graph = graph_db

    def store_fact(self, subject: str, predicate: str, obj: str, confidence: float = 1.0):
        self.graph.add_edge(
            subject, obj,
            relation=predicate,
            confidence=confidence,
            updated_at=now()
        )

    def query(self, subject: str, predicate: str = None) -> list:
        edges = self.graph.get_edges(subject)
        if predicate:
            edges = [e for e in edges if e.relation == predicate]
        return edges

    def traverse(self, start: str, max_depth: int = 3) -> dict:
        """Multi-hop reasoning: follow relationships."""
        visited = {}
        queue = [(start, 0)]
        while queue:
            node, depth = queue.pop(0)
            if depth > max_depth or node in visited:
                continue
            visited[node] = self.graph.get_edges(node)
            for edge in visited[node]:
                queue.append((edge.target, depth + 1))
        return visited

Procedural memory: learned behaviors

Procedural memory encodes how to do things — successful strategies, plans, tool usage patterns, and workflows that the agent has picked up over time.

What these look like

- pattern: "database_migration"
  description: "When the user asks to change a database schema"
  steps:
    - "Read the current migration files to understand the schema"
    - "Generate a new migration using alembic revision --autogenerate"
    - "Review the generated migration for correctness"
    - "Apply with alembic upgrade head"
    - "Run tests to verify"
  learned_from: ["sess_001", "sess_014", "sess_022"]
  success_rate: 0.95

- pattern: "debug_test_failure"
  description: "When a test fails after code changes"
  steps:
    - "Read the full test error output"
    - "Identify the failing assertion"
    - "Trace the code path from test to implementation"
    - "Check for recent changes in the relevant files using git diff"
  learned_from: ["sess_003", "sess_008"]
  success_rate: 0.88

How procedural memories form

  1. Explicit instruction: The user tells the agent “always run linting before committing.” The agent stores this as a procedural rule.
  2. Reinforcement from outcomes: The agent tries two approaches. Approach A fails; approach B succeeds. The agent records approach B as the preferred procedure.
  3. Distillation from episodes: After many episodic memories involving similar tasks, a background process extracts common patterns into procedural memory.

Memory storage backends

The right storage backend depends on what kind of memory you’re storing and how you need to access it.

Vector databases

Best for episodic memory and semantic search over documents.

Vector databases store high-dimensional embeddings and support approximate nearest neighbor (ANN) search. When the agent needs to find memories similar to a query, it embeds the query and retrieves the closest vectors.

Query: "How did I fix the CORS error last time?"
  --> embed() --> [0.12, -0.34, 0.56, ...]
  --> ANN search in vector DB
  --> Returns: memory about configuring CORS headers in Express middleware

Common choices: Pinecone, Weaviate, Qdrant, ChromaDB, pgvector.

The main trade-off: great for fuzzy/semantic matching, poor for exact lookups or structured queries. And embedding quality directly impacts retrieval quality, which is easy to underestimate.

Relational databases

Best for structured facts, user profiles, session metadata.

When memory has a clear schema — user preferences, project configuration, entity relationships — a relational database does exactly what you’d expect.

SELECT preference_value
FROM user_preferences
WHERE user_id = 'u_123' AND preference_key = 'language';

Graph databases

Best for knowledge graphs, entity relationships, multi-hop reasoning.

Graph databases (Neo4j, Amazon Neptune) shine when the agent needs to traverse relationships.

// Find all dependencies of the auth module, two levels deep
MATCH (m:Module {name: "auth"})-[:DEPENDS_ON*1..2]->(dep)
RETURN dep.name, dep.version

File systems

Best for simple persistent memory, human-readable storage.

Some agent frameworks just store memory as plain files — Markdown, YAML, JSON. It’s simple, you can read it yourself, and you can version-control it.

~/.agent/memory/
  MEMORY.md          # High-level summary, always loaded
  preferences.md     # User preferences
  project-notes.md   # Project-specific context
  patterns.md        # Recurring patterns and solutions

The downside: no semantic search without additional indexing, and it scales poorly once the memory store gets large.

Hybrid approaches

Production systems often combine multiple backends:

+-------------------+     +-------------------+
|   Vector Store    |     |   Relational DB   |
| (Episodic recall) |     | (Structured facts)|
+--------+----------+     +--------+----------+
         |                          |
         +------------+-------------+
                      |
              +-------v--------+
              | Memory Router  |  <-- Decides which store to query
              +-------+--------+
                      |
              +-------v--------+
              | Context Builder|
              +----------------+

Retrieval strategies

Storing memories is the easy part. The hard part is getting the right ones back at the right time.

The most common approach. Embed the current query and find the nearest neighbors in the vector store.

def retrieve_by_similarity(query: str, top_k: int = 5):
    query_embedding = embed(query)
    return vector_store.search(query_embedding, top_k=top_k)

The limitation is obvious: semantic similarity doesn’t always equal relevance. A memory might be semantically close but contextually irrelevant.

2. Recency-weighted retrieval

Combine similarity with a time-decay factor so that recent memories are preferred.

def retrieve_with_recency(query: str, top_k: int = 5, decay_rate: float = 0.995):
    candidates = vector_store.search(embed(query), top_k=top_k * 3)
    for c in candidates:
        hours_ago = (now() - c.timestamp).total_seconds() / 3600
        c.score = c.similarity * (decay_rate ** hours_ago)
    candidates.sort(key=lambda c: c.score, reverse=True)
    return candidates[:top_k]

3. Importance-based retrieval

Assign an importance score to each memory at write time. High-importance memories (e.g., user corrections, critical errors) get retrieval priority.

importance_heuristics = {
    "user_correction": 1.0,    # User explicitly corrected the agent
    "task_failure": 0.9,       # A task failed -- lesson learned
    "task_success": 0.5,       # Routine success
    "observation": 0.3,        # General observation
}

4. Generative retrieval (self-query)

The agent uses the LLM itself to formulate a retrieval query. Instead of directly embedding the user’s message, the agent first reasons about what information it needs.

User: "Can you set up the CI pipeline?"

Agent thinks: "I should check if I have any memories about:
  1. This project's existing CI configuration
  2. The user's preferred CI platform
  3. Past CI setups I've done for this user"

--> Generates 3 targeted queries for the memory store
def generative_retrieve(user_message: str, context: str):
    retrieval_queries = llm.generate(
        f"Given this message: '{user_message}' and context: '{context}', "
        f"generate 3 specific queries to search my memory store for relevant information."
    )
    results = []
    for query in retrieval_queries:
        results.extend(vector_store.search(embed(query), top_k=3))
    return deduplicate(results)

5. Hybrid retrieval (RAG fusion)

Combine multiple retrieval strategies and merge the results using reciprocal rank fusion.

def reciprocal_rank_fusion(result_lists: list[list], k: int = 60) -> list:
    scores = defaultdict(float)
    for result_list in result_lists:
        for rank, item in enumerate(result_list):
            scores[item.id] += 1.0 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Memory management: writing, updating, and forgetting

What to remember

Not everything should be stored. Saving everything creates noise that makes retrieval worse. Here’s a typical filter:

def should_memorize(event: dict) -> bool:
    # Always remember explicit user instructions
    if event["type"] == "user_instruction":
        return True

    # Always remember corrections
    if event["type"] == "user_correction":
        return True

    # Remember task outcomes
    if event["type"] in ("task_success", "task_failure"):
        return True

    # Filter out routine exchanges
    if event["type"] == "chitchat":
        return False

    # Use LLM judgment for ambiguous cases
    return llm.judge(f"Is this worth remembering long-term? {event['content']}")

Memory consolidation

Borrowing from how human memory consolidation works during sleep, agents can run periodic background processes to merge duplicate memories, extract patterns from repeated episodes into procedural memory, adjust confidence scores based on confirmation or contradiction, and prune anything stale or low-value.

def consolidate_memories(memories: list):
    # Cluster similar memories
    clusters = cluster_by_similarity(memories, threshold=0.9)

    for cluster in clusters:
        if len(cluster) > 1:
            # Merge into a single consolidated memory
            merged = llm.summarize([m.content for m in cluster])
            memory_store.upsert(merged)
            for m in cluster:
                memory_store.delete(m.id)

    # Prune old, low-importance memories
    stale = memory_store.query(
        filter={"importance": {"$lt": 0.3}, "age_days": {"$gt": 90}}
    )
    for m in stale:
        memory_store.delete(m.id)

Forgetting on purpose

Deliberate forgetting matters as much as remembering. A user says “forget my API key” — that needs to actually work. When new facts contradict old ones, something has to give. As the memory store grows, low-value memories need eviction. And PII creates real retention obligations that you can’t hand-wave away.

Architectural patterns in practice

Pattern 1: MemGPT / self-managed memory

Inspired by operating system virtual memory, the MemGPT architecture gives the agent explicit tools to manage its own memory:

+-------------------------------------------+
|              LLM (Main Context)           |
|  +-------------+  +-------------------+  |
|  | System Prompt|  | Working Context   |  |
|  +-------------+  +-------------------+  |
|                                           |
|  Available Tools:                         |
|  - core_memory_save(key, value)           |
|  - core_memory_replace(key, old, new)     |
|  - archival_memory_insert(content)        |
|  - archival_memory_search(query)          |
|  - conversation_search(query)             |
+-------------------------------------------+
           |                    ^
           v                    |
  +------------------+  +------------------+
  | Archival Memory  |  | Recall Storage   |
  | (Vector DB)      |  | (Conversation DB)|
  +------------------+  +------------------+

The agent decides when and what to store, and actively retrieves information when needed. The catch: it only works if the LLM reliably uses its memory tools, and in practice, that reliability is… uneven.

Pattern 2: RAG-based memory

This is probably the most common pattern in production today. Every user message triggers an automatic retrieval step before the LLM generates a response.

User Message
     |
     v
+----+-----+
| Retriever | --> Query memory store with user message
+----+-----+
     |
     v
+---------+
| Ranker  | --> Re-rank and filter retrieved memories
+---------+
     |
     v
+----+-----+
|  Prompt  | --> Inject top memories into system/context
| Builder  |
+----+-----+
     |
     v
+----+-----+
|   LLM    | --> Generate response with memory-augmented context
+----------+

Simpler to implement, and it doesn’t depend on the LLM remembering to use memory tools. The downside: it retrieves memories on every turn whether you need them or not, which can introduce noise.

Pattern 3: Reflection-based memory

After completing a task or conversation, the agent runs a reflection step to extract and store lessons learned.

def post_task_reflection(conversation: list, outcome: str):
    reflection_prompt = f"""
    Review this completed task conversation and extract:
    1. Key decisions made and their outcomes
    2. User preferences observed
    3. Mistakes to avoid in the future
    4. Reusable patterns or procedures

    Conversation: {conversation}
    Outcome: {outcome}
    """
    insights = llm.generate(reflection_prompt)
    for insight in parse_insights(insights):
        memory_store.save(insight)

This tends to produce higher-quality memories. The agent has full context about what happened and can reason about what actually mattered, rather than trying to decide in the moment.

Pattern 4: Hierarchical memory

Organize memories at multiple levels of abstraction:

Level 0 (Core):     Always in context. User identity, critical rules.
                    Size: ~500 tokens. Updated rarely.

Level 1 (Project):  Loaded per-project. Architecture, conventions, key files.
                    Size: ~2000 tokens. Updated per session.

Level 2 (Episodic): Retrieved on demand. Past interactions, decisions.
                    Size: Unbounded. Searched via embeddings.

Level 3 (Archive):  Compressed old memories. Rarely accessed.
                    Size: Unbounded. Searched only as fallback.
class HierarchicalMemory:
    def build_context(self, query: str, project_id: str):
        context = []

        # Level 0: Always included
        context.append(self.core_memory.load())

        # Level 1: Project-specific
        context.append(self.project_memory.load(project_id))

        # Level 2: Relevant episodes
        episodes = self.episodic_memory.search(query, top_k=5)
        context.extend(episodes)

        # Level 3: Archive (only if Level 2 results are insufficient)
        if self.needs_deeper_search(episodes, query):
            archive_results = self.archive_memory.search(query, top_k=3)
            context.extend(archive_results)

        return context

Challenges and open problems

1. Memory staleness

Facts change over time. A user might switch from PostgreSQL to MySQL, or a project might be restructured. Stale memories can cause the agent to make incorrect assumptions.

Some mitigations: timestamp everything, prefer recent memories in retrieval, and periodically prompt the agent to verify old facts.

2. Retrieval precision vs. recall

Retrieving too few memories risks missing critical context. Retrieving too many wastes context window tokens and introduces noise.

Some mitigations: adaptive retrieval budgets, re-ranking with cross-encoders, and letting the agent request more context if initial results look thin.

3. Memory poisoning

If an agent learns from unvalidated external sources, adversarial content could be injected into its memory and cause persistent misbehavior. This is a real threat, not a hypothetical one.

Some mitigations: provenance tracking (tag memories with their source), trust levels, and periodic memory audits.

4. Privacy and compliance

Persistent memory creates real data retention obligations. Memories may contain PII, credentials, or sensitive business information.

The usual defenses apply: encryption at rest, access controls, automated PII detection, configurable retention policies, right-to-deletion APIs. But the surface area is larger than people tend to assume.

5. Evaluation is still messy

How do you measure whether an agent’s memory system is working well? Nobody has a great answer yet. Some metrics people are trying: whether the agent retrieves the right memory for a given query (recall accuracy), how often retrieved memories actually influence the response (utilization rate), what fraction of retrieved memories are outdated (staleness), and how often stored memories conflict with each other (contradictions). But good benchmarks are still scarce.

6. Scaling

As memory stores grow to millions of entries, retrieval latency and relevance both degrade. Hierarchical indexing, memory consolidation, and tiered storage become necessary.

Wrapping up

Memory is what turns a stateless language model into something that can actually persist and improve. The ideas come from all over — cognitive science gives us the episodic/semantic/procedural taxonomy, database systems give us vector search and graph traversal, operating systems give us the virtual memory analogy.

If you’re building an agent memory system, the decisions that matter most:

What to store. Not everything is worth remembering. Filter aggressively or you’ll drown in noise.

How to store it. Match the backend to the access pattern. Vectors for semantic search, structured DBs for exact lookups, graphs for relationships.

When and how to retrieve. This has the biggest impact on whether memory actually helps. Pure similarity search is a starting point, but combining it with recency and importance signals makes a real difference.

How to maintain it. Memory isn’t write-once. You need consolidation, conflict resolution, and pruning, or quality degrades over time.

How to forget. Sounds paradoxical, but deliberate forgetting is a feature, not a bug. For both quality and privacy.

In practice, the systems that work best are hybrids — mixing memory types, storage backends, and retrieval strategies. The tooling is still young, and I expect we’ll see a lot of churn before patterns stabilize. But the core ideas in this post are unlikely to change much.

Further reading:

“What are you without your memory?”-Rushi

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>