ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

AuthorsXiaohui Zhang, Zequn Sun, Chengyuan Yang et al.

2026

TL;DR

ActMem builds a causal memory knowledge graph with counterfactual reasoning, boosting QA accuracy on ActMemEval to 76.52% vs 63.97% for LightMem (+12.55 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Agents Retrieve but Fail to Reason over Long Term Memory

Existing memory benchmarks like LongMemEval mainly test fact retrieval, not whether agents can reason over history to detect conflicts.

When users ask action questions, agents relying on passive RAG can miss implicit dangers, leading to conflicting or even dangerous recommendations.

HOW IT WORKS

ActMem — Actionable Memory via Causal Knowledge Graphs

ActMem’s core mechanism chains Memory Fact Extraction, Fact Clustering, Memory KG Construction, and Counterfactual-based Retrieval and Reasoning into a single actionable memory pipeline.

You can think of ActMem like a card catalog plus a cause–effect map: it files each event as a card, then draws arrows showing what causes what.

This structured causal graph lets ActMem infer implicit constraints and conflicts that a plain context window or vanilla RAG system would never surface.

DIAGRAM

Query Time Reasoning and Retrieval Flow in ActMem

This diagram shows how ActMem uses counterfactual reasoning and the memory KG to retrieve implicit constraints for a user query.

DIAGRAM

ActMemEval Dataset Construction Pipeline

This diagram shows how ActMemEval is synthesized from reasoning topologies, stylized dialogues, noise injection, and manual verification.

PROCESS

How ActMem Handles a Long Term Dialogue Query

01
Memory Fact Extraction
ActMem uses Memory Fact Extraction to convert raw dialogue turns into atomic facts, forming the base fact set F for later reasoning.
02
Fact Clustering
ActMem applies Fact Clustering with Qwen3 Embedding 8B to group related facts into topic clusters, reducing causal mining complexity.
03
Memory KG Construction
ActMem’s Memory KG Construction adds semantic edges by similarity thresholds and causal edges validated by PMI scores above 0.8.
04
Counterfactual based Retrieval and Reasoning
ActMem runs Counterfactual based Retrieval and Reasoning, generating commonsense consequences kcs and refining retrieval to answer safely.

KEY CONTRIBUTIONS

Key Contributions

01
Actionable Memory Management Framework
ActMem integrates Memory Fact Extraction, Fact Clustering, Memory KG Construction, and Counterfactual based Retrieval and Reasoning to shift focus from recall capacity to memory utility for action.
02
Causal and Semantic Memory KG
ActMem builds a memory KG with semantic edges and PMI validated causal edges, enabling event centric reasoning beyond flat RAG retrieval.
03
ActMemEval Benchmark
ActMem introduces ActMemEval with 246 verified samples and an average answer similarity of 0.232, targeting implicit constraints and causal conflicts.

RESULTS

By the Numbers

Retrieval Acc.

71.66%

+14.78 over LightMem with DeepSeek-V3

QA Acc.

76.52%

+12.55 over LightMem with DeepSeek-V3

Retrieval Acc.

76.92%

+10.33 over LightMem with GPT-4o-mini

QA Acc.

52.22%

+11.73 over Mem0 with GPT-4o-mini

On ActMemEval, which stresses implicit constraints and causal conflicts, ActMem raises QA accuracy from LightMem’s 63.97% to 76.52% with DeepSeek-V3 and from 40.49% to 52.22% with GPT-4o-mini, demonstrating that causal KGs plus counterfactual reasoning materially improve action aware memory use.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison on ActMemEval (DeepSeek-V3)

QA Acc. on ActMemEval for DeepSeek-V3 based memory frameworks.

KEY INSIGHT

The Counterintuitive Finding

On GPT 4o mini, ActMem reaches 76.92% retrieval accuracy but only 52.22% QA accuracy, revealing a reasoning bottleneck despite strong recall.

This is surprising because higher retrieval accuracy usually predicts better QA, yet ActMem shows that weaker LLMs cannot fully exploit high quality causal memories.

WHY IT MATTERS

What this unlocks for the field

ActMem unlocks memory aware agents that can detect implicit conflicts, like toxic plants for pets, by reasoning over causal KGs instead of raw logs.

Builders can now design assistants that proactively intervene in risky plans and maintain long term logical consistency, rather than merely echoing retrieved snippets.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…