MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

AuthorsDongming Jiang, Yi Li, Guanpeng Li, Bingzhe Li

2026

TL;DR

MAGMA uses policy-guided traversal over multi-graph memory to reach a 0.700 overall judge score on LoCoMo, +0.110 over Nemori.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents fail when monolithic memory blurs temporal and causal structure

MAGMA targets MAG systems where monolithic stores entangle temporal, causal, and entity information, limiting interpretability and alignment between query intent and retrieved evidence.

Under these designs, long-context reasoning degrades, causing suboptimal reasoning accuracy and misaligned retrieval in benchmarks like LoCoMo and LongMemEval.

HOW IT WORKS

MAGMA: Multi-Graph Agentic Memory Architecture

MAGMA centers on a Data Structure Layer with four Relation Graphs plus a Vector Database, orchestrated by an Intent-Aware Router and Adaptive Topological Retrieval.

You can think of MAGMA like a library with four separate card catalogs for semantic, temporal, causal, and entity relations, and a librarian that chooses which catalog to follow per question.

This design lets MAGMA trace explicit relational paths and construct structured prompts, enabling transparent, intent-aligned retrieval that a flat context window cannot provide.

DIAGRAM

Query-time Retrieval Flow in MAGMA

This diagram shows how MAGMA analyzes a user query, identifies anchors, traverses multi-graphs, and synthesizes a linearized context for the LLM.

DIAGRAM

MAGMA Evaluation and Ablation Pipeline

This diagram shows how MAGMA is built, evaluated on LoCoMo and LongMemEval, and ablated over relation types and policies.

PROCESS

How MAGMA Handles a Query Process

01
Query Analysis and Decomposition
MAGMA uses the Query Analyzer to derive intent classification Tq, temporal parsing, dense embeddings, and sparse keywords from the incoming query.
02
Multi Signal Anchor Identification
MAGMA fuses semantic, lexical, and temporal signals with Reciprocal Rank Fusion to select robust anchor nodes as traversal entry points.
03
Adaptive Traversal Policy
MAGMA runs Heuristic Beam Search guided by the Adaptive Traversal Policy, weighting semantic, temporal, causal, and entity edges according to Tq.
04
Narrative Synthesis via Graph Linearization
MAGMA orders retrieved nodes by temporal or causal structure, adds provenance, and budgets tokens to form the linearized context Cprompt.

KEY CONTRIBUTIONS

Key Contributions

01
Multi Graph Agentic Memory Architecture
MAGMA introduces a Data Structure Layer with semantic, temporal, causal, and entity Relation Graphs plus a Vector Database, enabling disentangled long horizon reasoning.
02
Adaptive Traversal Policy
MAGMA defines an Adaptive Traversal Policy that scores transitions with structural alignment and semantic affinity, reducing LoCoMo adversarial errors to a 0.742 judge score.
03
Dual Stream Memory Evolution Mechanism
MAGMA uses Synaptic Ingestion and Asynchronous Consolidation to keep latency at 1.47 seconds per query while maintaining a 0.700 overall LoCoMo judge score.

RESULTS

By the Numbers

Overall

0.700 score

+0.110 over Nemori

Adversarial

0.742 score

+0.126 over A-MEM

Average

61.2 %

+5.0 over Nemori on LongMemEval

Latency (s)

1.47 s

-0.27 vs Full Context query latency

On LoCoMo, which stresses multi hop, temporal, and adversarial long context reasoning, MAGMA reaches a 0.700 overall LLM as a Judge score. On LongMemEval with average 100K token contexts, MAGMA attains 61.2% average accuracy while keeping tokens per query between 0.7K and 4.2K.

BENCHMARK

By the Numbers

BENCHMARK

LoCoMo Overall LLM-as-a-Judge Performance

Overall LLM-as-a-Judge score on LoCoMo across MAGMA and baseline memory systems.

BENCHMARK

LongMemEval Average Accuracy Comparison

Average accuracy on LongMemEval across methods with different context token budgets.

KEY INSIGHT

The Counterintuitive Finding

MAGMA uses only 0.7K–4.2K tokens per query on LongMemEval yet reaches 61.2% average accuracy, beating the 101K token Full Context baseline at 55.0%.

This is surprising because many assume feeding everything to the LLM is best, but MAGMA shows structured, selective retrieval can exceed full context despite using over 95% fewer tokens.

WHY IT MATTERS

What this unlocks for the field

MAGMA enables agents to reason over ultra long histories using explicit semantic, temporal, causal, and entity graphs plus intent aware traversal.

Builders can now design agentic systems that stay coherent across 100K+ token interactions while remaining fast and interpretable, instead of relying on opaque, monolithic vector memories.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…