ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

AuthorsXiaohui Zhang, Zequn Sun, Chengyuan Yang et al.

2026

TL;DR

ActMem builds a causal memory knowledge graph with counterfactual reasoning, boosting QA accuracy on ActMemEval to 76.52% vs 63.97% for LightMem (+12.55 points).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Agents Retrieve but Fail to Reason over Long Term Memory

Existing memory benchmarks like LongMemEval mainly test fact retrieval, not whether agents can reason over history to detect conflicts.

When users ask action questions, agents relying on passive RAG can miss implicit dangers, leading to conflicting or even dangerous recommendations.

HOW IT WORKS

ActMem — Actionable Memory via Causal Knowledge Graphs

ActMem’s core mechanism chains Memory Fact Extraction, Fact Clustering, Memory KG Construction, and Counterfactual-based Retrieval and Reasoning into a single actionable memory pipeline.

You can think of ActMem like a card catalog plus a cause–effect map: it files each event as a card, then draws arrows showing what causes what.

This structured causal graph lets ActMem infer implicit constraints and conflicts that a plain context window or vanilla RAG system would never surface.

DIAGRAM

Query Time Reasoning and Retrieval Flow in ActMem

This diagram shows how ActMem uses counterfactual reasoning and the memory KG to retrieve implicit constraints for a user query.

DIAGRAM

ActMemEval Dataset Construction Pipeline

This diagram shows how ActMemEval is synthesized from reasoning topologies, stylized dialogues, noise injection, and manual verification.

PROCESS

How ActMem Handles a Long Term Dialogue Query

  1. 01

    Memory Fact Extraction

    ActMem uses Memory Fact Extraction to convert raw dialogue turns into atomic facts, forming the base fact set F for later reasoning.

  2. 02

    Fact Clustering

    ActMem applies Fact Clustering with Qwen3 Embedding 8B to group related facts into topic clusters, reducing causal mining complexity.

  3. 03

    Memory KG Construction

    ActMem’s Memory KG Construction adds semantic edges by similarity thresholds and causal edges validated by PMI scores above 0.8.

  4. 04

    Counterfactual based Retrieval and Reasoning

    ActMem runs Counterfactual based Retrieval and Reasoning, generating commonsense consequences kcs and refining retrieval to answer safely.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Actionable Memory Management Framework

    ActMem integrates Memory Fact Extraction, Fact Clustering, Memory KG Construction, and Counterfactual based Retrieval and Reasoning to shift focus from recall capacity to memory utility for action.

  • 02

    Causal and Semantic Memory KG

    ActMem builds a memory KG with semantic edges and PMI validated causal edges, enabling event centric reasoning beyond flat RAG retrieval.

  • 03

    ActMemEval Benchmark

    ActMem introduces ActMemEval with 246 verified samples and an average answer similarity of 0.232, targeting implicit constraints and causal conflicts.

RESULTS

By the Numbers

Retrieval Acc.

71.66%

+14.78 over LightMem with DeepSeek-V3

QA Acc.

76.52%

+12.55 over LightMem with DeepSeek-V3

Retrieval Acc.

76.92%

+10.33 over LightMem with GPT-4o-mini

QA Acc.

52.22%

+11.73 over Mem0 with GPT-4o-mini

On ActMemEval, which stresses implicit constraints and causal conflicts, ActMem raises QA accuracy from LightMem’s 63.97% to 76.52% with DeepSeek-V3 and from 40.49% to 52.22% with GPT-4o-mini, demonstrating that causal KGs plus counterfactual reasoning materially improve action aware memory use.

BENCHMARK

By the Numbers

On ActMemEval, which stresses implicit constraints and causal conflicts, ActMem raises QA accuracy from LightMem’s 63.97% to 76.52% with DeepSeek-V3 and from 40.49% to 52.22% with GPT-4o-mini, demonstrating that causal KGs plus counterfactual reasoning materially improve action aware memory use.

BENCHMARK

Performance comparison on ActMemEval (DeepSeek-V3)

QA Acc. on ActMemEval for DeepSeek-V3 based memory frameworks.

KEY INSIGHT

The Counterintuitive Finding

On GPT 4o mini, ActMem reaches 76.92% retrieval accuracy but only 52.22% QA accuracy, revealing a reasoning bottleneck despite strong recall.

This is surprising because higher retrieval accuracy usually predicts better QA, yet ActMem shows that weaker LLMs cannot fully exploit high quality causal memories.

WHY IT MATTERS

What this unlocks for the field

ActMem unlocks memory aware agents that can detect implicit conflicts, like toxic plants for pets, by reasoning over causal KGs instead of raw logs.

Builders can now design assistants that proactively intervene in risky plans and maintain long term logical consistency, rather than merely echoing retrieved snippets.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Questions about this paper?

Paper: ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Answers use this explainer on Memory Papers.

Checking…