AMA: Adaptive Memory via Multi-Agent Collaboration

AuthorsWeiquan Huang, Zixuan Wang, Hehai Lin et al.

2026

TL;DR

AMA uses a multi-agent Constructor–Retriever–Judge–Refresher pipeline to adapt memory granularity, reaching 0.805 LLM Score on LoCoMo vs 0.774 for Nemori with ~80% fewer tokens than FullContext.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Static memory granularity breaks long-term agents by accumulating inconsistencies

Existing memory systems rely on rigid retrieval granularity, accumulation-heavy maintenance, and coarse-grained updates, creating a persistent mismatch with task-specific reasoning demands.

These static designs cause logical inconsistencies to accumulate over time, so long-term LLM agents face noisy retrieval, fragmented dependencies, and eventual reasoning failures in complex tasks.

HOW IT WORKS

Adaptive Memory via Multi-Agent Collaboration

AMA decomposes memory into four collaborating agents: the Constructor, Retriever, Judge, and Refresher, which build and manage Raw Text, Fact Knowledge, and Episode Memory across granularities.

You can think of AMA like a computer with RAM and disk plus a background integrity checker: the Retriever chooses which store to read, while the Judge and Refresher keep everything coherent.

This multi-agent design lets AMA dynamically align retrieval granularity with task demands and explicitly resolve conflicts, something a plain context window or static RAG pipeline cannot achieve.

DIAGRAM

AMA Inference and Retrieval Flow

This diagram shows how AMA processes an input utterance through query rewriting, intent routing, multi-granularity retrieval, and Judge feedback loops during inference.

DIAGRAM

AMA Evaluation and Ablation Design

This diagram shows how AMA is evaluated on LoCoMo and LongMemEvals, including ablations over memory types and the Refresher.

PROCESS

How AMA Handles a Long-Context Session

01
Constructor
The Constructor parses ut, Wt, and conflict-free history H*t into Raw Text Memory, Fact Knowledge Memory, and Episode Memory using structured fact templates and triggers.
02
Retriever
The Retriever rewrites ut into u't, infers the intent vector B and Kdyn, and routes the query to Raw Text, Fact Knowledge, or Episode Memory based on fM.
03
Judge
The Judge evaluates the candidate set Ht for relevance, triggers Retry when density is low, and detects conflicts to produce Cerr and actions Pass, Retry, or Refresh.
04
Refresher
The Refresher deletes or updates conflicting entries Cerr to yield a consistent H*t, which the Constructor then uses to synthesize new multi-granularity memories.

KEY CONTRIBUTIONS

Key Contributions

01
Multi-granularity memory paradigm
AMA introduces Raw Text Memory, Fact Knowledge Memory, and Episode Memory via the Constructor and Retriever, achieving an LLM Score of 0.774 on LoCoMo with GPT-4o-mini.
02
Unified multi-agent framework
AMA coordinates the Constructor, Retriever, Judge, and Refresher to orchestrate storage, adaptive routing, verification, and maintenance across long-context applications.
03
Logic-driven Refresher for updates
AMA’s Refresher enforces consistency, enabling 0.897 accuracy on knowledge-update tasks in LongMemEvals and preventing unchecked accumulation of outdated or conflicting facts.

RESULTS

By the Numbers

LLM Score

0.805

+0.031 over Nemori on LoCoMo with GPT-4.1-mini

0.580

vs Nemori 0.521 on LoCoMo with GPT-4.1-mini

BLEU-1

0.492

LoCoMo overall with GPT-4.1-mini

Knowledge-update accuracy

0.897

LongMemEvals knowledge-update vs Zep 0.744 and Nemori 0.615

On LoCoMo, which tests single-hop, multi-hop, temporal reasoning, and open-domain queries, AMA improves LLM Score from Nemori’s 0.774 to 0.805 with GPT-4.1-mini. On LongMemEvals, AMA reaches 0.698 average accuracy and 0.897 on knowledge-update questions, showing that AMA maintains consistent long-term memory while reducing token usage.

BENCHMARK

By the Numbers

BENCHMARK

Main results on the LoCoMo benchmark (GPT-4.1-mini, overall LLM Score)

LLM Score on LoCoMo for AMA and key memory baselines using GPT-4.1-mini.

BENCHMARK

Ablation studies on memory design (LoCoMo LLM Score)

LLM Score on LoCoMo for AMA ablations over Raw Text, Fact Knowledge, Episode Memory, and Refresher.

KEY INSIGHT

The Counterintuitive Finding

AMA with Kr = 2 uses only 3613 tokens and 3.910 seconds latency, yet achieves an LLM Score of 0.774 on LoCoMo versus FullContext’s 0.717 with 18625 tokens.

This is surprising because many assume feeding all context is always best, but AMA shows adaptive memory and conflict-aware pruning can beat full-context reasoning while using about 19% of the tokens.

WHY IT MATTERS

What this unlocks for the field

AMA enables long-lived agents to maintain coherent, conflict-free memories across Raw Text, Fact Knowledge, and Episode summaries while dynamically choosing the right granularity per query.

Builders can now deploy LLM agents that handle LoCoMo-scale histories and knowledge updates efficiently, without relying on massive context windows or brittle static chunking strategies.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…