D-Mem: A Dual-Process Memory System for LLM Agents

AuthorsZhixing You, Jiachen Yuan, Jason Cai

2026

TL;DR

D-Mem uses Multi-dimensional Quality Gating to route between Mem0∗ and Full Deliberation, reaching 53.5 F1 on LoCoMo while recovering 96.7% of the 55.3 F1 upper bound.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents lose fine-grained context due to lossy abstraction

D-Mem targets retrieval frameworks that “strip away potentially crucial contextual nuances,” leaving static retrieval unable to reconstruct logical chains lost during compression.

When LoCoMo dialogues average 24K tokens, simply feeding full history is expensive and worsens the Lost in the Middle effect, degrading deep reasoning and temporal logic.

HOW IT WORKS

D-Mem — Dual-process memory with gated deliberation

D-Mem centers on Mem0∗, Quality Gating, and Full Deliberation, combining incremental vector memory with a query-guided raw-history fallback.

You can think of Mem0∗ as fast RAM for routine recall, while Full Deliberation acts like a slower disk scan that rereads the entire log when needed.

This design lets D-Mem recover nuanced temporal and multi-hop dependencies that a plain context window or static top K retrieval cannot preserve.

DIAGRAM

Query-time dual-process flow in D-Mem

This diagram shows how D-Mem routes a query through Mem0∗, Quality Gating, and Full Deliberation at inference time.

DIAGRAM

Evaluation setup across LoCoMo and RealTalk

This diagram shows how D-Mem is evaluated on LoCoMo and RealTalk with GPT-4o-mini and Qwen3-235B-Instruct, including baselines and metrics.

PROCESS

How D-Mem Handles a LoCoMo Question

  1. 01

    Mem0∗: The System 1 Retrieval Foundation

    D-Mem uses Mem0∗ to retrieve the top 30 most similar memories C from the vector database and generate an initial answer Ainit for the query.

  2. 02

    Gated Deliberation Policies

    D-Mem applies Quality Gating to evaluate Ainit against the query and context along Relevance, Faithfulness and Consistency, and Completeness dimensions.

  3. 03

    Full Deliberation

    If Quality Gating fails, D-Mem triggers Full Deliberation to chunk the full conversation, extract scored facts, and filter them into an enhanced context C'.

  4. 04

    Answer Generation

    Using either C or C', D-Mem calls the backbone LLM to produce the final answer, which is evaluated with F1, BLEU, and LLM-as-a-Judge metrics.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    The dual-process D-Mem Framework

    D-Mem integrates Mem0∗, Quality Gating, and Full Deliberation to bridge efficient vector retrieval with exhaustive deliberate reading for long-horizon reasoning.

  • 02

    Full Deliberation as a robust baseline

    D-Mem’s Full Deliberation processes raw dialogue chunk by chunk, reaching 55.3 F1 and 78.4 LLM score on LoCoMo with GPT-4o-mini.

  • 03

    High Performance with Computational Efficiency

    D-Mem’s Quality Gating attains 53.5 F1 on LoCoMo with GPT-4o-mini, recovering 96.7% of Full Deliberation’s 55.3 F1 while using only 35.8% of its tokens.

RESULTS

By the Numbers

F1

53.5

+2.3 over Mem0∗ on LoCoMo GPT-4o-mini

LLM

76.3

+3.6 over Mem0∗ on LoCoMo GPT-4o-mini

BLEU

43.1

+2.1 over Mem0∗ on LoCoMo GPT-4o-mini

Tokens

12681

35.8% of Full Deliberation token cost 35435 on LoCoMo GPT-4o-mini

On the LoCoMo benchmark, which contains 10 dialogues averaging 24K tokens and 1,540 questions, D-Mem’s Quality Gating nearly matches Full Deliberation while greatly reducing tokens. This shows D-Mem can maintain long-term reasoning fidelity without paying the full 35,435-token cost per query.

BENCHMARK

By the Numbers

On the LoCoMo benchmark, which contains 10 dialogues averaging 24K tokens and 1,540 questions, D-Mem’s Quality Gating nearly matches Full Deliberation while greatly reducing tokens. This shows D-Mem can maintain long-term reasoning fidelity without paying the full 35,435-token cost per query.

BENCHMARK

Overall F1 on LoCoMo with GPT-4o-mini

F1 on LoCoMo for D-Mem Quality Gating versus Mem0∗, Nemori, and Full Context.

KEY INSIGHT

The Counterintuitive Finding

D-Mem’s Quality Gating recovers 96.7% of Full Deliberation’s 55.3 F1 on LoCoMo while using only 12,681 tokens versus 35,435 tokens.

This is surprising because exhaustive Full Deliberation increases tokens and inference time by over 10×, yet D-Mem achieves nearly the same accuracy without paying that cost on every query.

WHY IT MATTERS

What this unlocks for the field

D-Mem makes it practical to combine lightweight vector memories with selective, high-fidelity raw-history reading for long-horizon agents.

Builders can now deploy agents that keep months of dialogue, answer temporal and multi-hop questions robustly, and still stay within tight latency and token budgets.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Agent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

Agent Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

Questions about this paper?

Paper: D-Mem: A Dual-Process Memory System for LLM Agents

Answers use this explainer on Memory Papers.

Checking…