A-MEM: Agentic Memory for LLM Agents

AuthorsWujiang Xu, Zujie Liang, Kai Mei et al.

2025

TL;DR

A-MEM uses LLM-driven note construction, link generation, and memory evolution to build agentic Zettelkasten-style memory, doubling GPT-4o-mini multi hop F1 on LoCoMo (27.02 vs 12.60).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Rigid agent memories need 16,910 tokens per answer and still miss long range reasoning

Existing systems like LoCoMo and MemGPT require around 16,910 tokens per question, yet still struggle with complex long term reasoning.

These fixed memory workflows and predefined schemas prevent LLM agents from forming new connections, hurting multi hop and temporal QA performance in long conversations.

HOW IT WORKS

A-MEM — Agentic Zettelkasten for LLM agents

A-MEM centers on Note Construction, Link Generation, Memory Evolution, and Retrieve Relative Memory to build rich, interconnected memory notes with embeddings and links.

You can think of A-MEM as a self organizing Zettelkasten card catalog, where each card can rewrite itself and cross reference others as new cards arrive.

This agentic structure lets A-MEM grow and reshape its memory graph over time, enabling reasoning chains that a fixed context window or static RAG index cannot support.

DIAGRAM

Memory lifecycle from interaction to evolved note graph

This diagram shows how A-MEM processes a new interaction through note construction, link generation, and memory evolution before future retrieval.

DIAGRAM

Evaluation pipeline and ablation design for A-MEM

This diagram shows how A-MEM is evaluated on LoCoMo and DialSim, including ablations removing Link Generation and Memory Evolution.

PROCESS

How A-MEM Handles a Long Term Conversation Session

  1. 01

    Note Construction

    A-MEM converts each interaction into a memory note with content, timestamp, keywords, tags, contextual description, and an embedding for later retrieval.

  2. 02

    Link Generation

    A-MEM retrieves top k similar notes using embeddings, then uses an LLM to decide which historical memories should be linked to the new note.

  3. 03

    Memory Evolution

    For each nearest neighbor, A-MEM may update its context, keywords, and tags using the new memory, replacing the old note with an evolved version.

  4. 04

    Retrieve Relative Memory

    Given a query, A-MEM encodes it, finds top k memories, and expands them via links so the agent receives a richer, box like context.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Agentic memory system A-MEM

    A-MEM introduces autonomous Note Construction, Link Generation, and Memory Evolution, enabling long term interaction without predetermined memory operations or fixed schemas.

  • 02

    Agentic memory update mechanism

    A-MEM defines a two stage update where new memories trigger link generation and memory evolution, discovering higher order patterns across historical notes.

  • 03

    Comprehensive evaluation on long term QA

    A-MEM is evaluated on LoCoMo and DialSim across six foundation models, achieving 3.45 F1 on DialSim versus 2.55 for LoCoMo and 1.18 for MemGPT.

RESULTS

By the Numbers

Multi Hop F1

27.02 %

+17.87 over READAGENT (GPT 4o-mini)

Temporal F1

45.85 %

+20.33 over MEMGPT (GPT 4o-mini)

Average token length

2,520 tokens

-14,390 vs LOCOMO baseline (GPT 4o-mini)

DialSim F1

3.45

+0.90 over LoCoMo and +2.27 over MemGPT

On the LoCoMo long term conversational QA benchmark, A-MEM improves GPT-4o-mini Multi Hop F1 from 9.15 (ReadAgent) to 27.02, while reducing average token length from 16,910 to 2,520. On DialSim, A-MEM reaches 3.45 F1 compared to 2.55 for LoCoMo and 1.18 for MemGPT, showing that agentic memory evolution yields better long range reasoning with far fewer tokens.

BENCHMARK

By the Numbers

On the LoCoMo long term conversational QA benchmark, A-MEM improves GPT-4o-mini Multi Hop F1 from 9.15 (ReadAgent) to 27.02, while reducing average token length from 16,910 to 2,520. On DialSim, A-MEM reaches 3.45 F1 compared to 2.55 for LoCoMo and 1.18 for MemGPT, showing that agentic memory evolution yields better long range reasoning with far fewer tokens.

BENCHMARK

Experimental results on LoCoMo dataset of QA tasks using GPT-4o-mini

Multi Hop F1 on LoCoMo for GPT-4o-mini across memory systems.

BENCHMARK

Ablation study against GPT-4o-mini base model

Multi Hop F1 on LoCoMo for GPT-4o-mini with and without Link Generation and Memory Evolution.

KEY INSIGHT

The Counterintuitive Finding

Despite adding extra LLM calls for Link Generation and Memory Evolution, A-MEM cuts average tokens from 16,910 to 2,520 per question with GPT-4o-mini.

This is surprising because richer memory structures usually imply more context, yet A-MEM achieves better F1 while using roughly 85% fewer tokens than LoCoMo and MemGPT.

WHY IT MATTERS

What this unlocks for the field

A-MEM shows that agentic, self evolving memory graphs can support complex multi hop and temporal reasoning in long conversations without massive context windows.

Builders can now attach A-MEM to lightweight models and still get strong long term behavior, enabling cheaper, scalable agents that learn from experience instead of just extending context length.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

Questions about this paper?

Paper: A-MEM: Agentic Memory for LLM Agents

Answers use this explainer on Memory Papers.

Checking…