AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation

AuthorsYupeng Huo, Yaxi Lu, Zhong Zhang et al.

2026

TL;DR

AtomMem uses learnable atomic CRUD memory operations with RL-optimized policies to reach 58.8 average score, +2.1 over MemAgent on long-context and web benchmarks.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Static memory workflows fail diverse tasks with one-size-fits-all rules

AtomMem targets static memory mechanisms that rely on hand-crafted workflows, which impose fixed rules instead of adaptive decisions.

These LLM-based agents struggle on long-horizon tasks, where static schedules like exponential forgetting can drop critical early cues and hurt downstream QA and web search performance.

HOW IT WORKS

AtomMem — Atomic CRUD Memory as Decision Making

AtomMem decomposes memory into atomic CRUD operations over a dynamic set Mt, controlled via a POMDP and optimized with Group Relative Policy Optimization.

You can think of AtomMem like a CPU managing RAM and disk: the scratchpad is fast working memory, while the FAISS vector database is long-term storage accessed via learned queries.

This design lets AtomMem learn when to Create, Read, Update, Delete entries so the agent maintains compact, task-aligned memory beyond what a fixed context window or static workflow can provide.

DIAGRAM

AtomMem Inference Flow Across Environment and Memory

This diagram shows how AtomMem jointly chooses environment actions and CRUD memory operations at each step in the POMDP.

DIAGRAM

AtomMem Training and Evaluation Pipeline

This diagram shows how AtomMem is trained with GRPO on QA and web tasks and then evaluated on long-context benchmarks.

PROCESS

How AtomMem Handles a Long-Context QA Task

01
POMDP for Memory
AtomMem formalizes memory as part of the POMDP state, splitting each global state into external environment senv and internal memory smem for decision making.
02
Memory Mechanism Implementation
AtomMem represents memory as a dynamic set Mt of entries and exposes CRUD as the Amem action space, composing multiple operations within a step.
03
Hybrid Memory Retrieval
AtomMem uses deterministic scratchpad retrieval plus selective TopK similarity retrieval from storage, forming observations ot that mix oenv, mscr, and ˆMt.
04
Optimization Strategy
AtomMem trains the token-level policy with Group Relative Policy Optimization, distributing task-level advantages across CRUD tokens and environment actions.

KEY CONTRIBUTIONS

Key Contributions

01
Atomic CRUD Memory Operations
AtomMem defines a complete CRUD action space over Mt, showing that any memory workflow can be composed from Create, Read, Update, and Delete primitives.
02
Hybrid Memory Retrieval Design
AtomMem combines a mandatory scratchpad with selective TopK vector retrieval, and ablations show removing scratchpad or storage drops HotpotQA by up to 8.6 points.
03
RL Trained Memory Policy with GRPO
AtomMem uses Group Relative Policy Optimization to learn task-aligned memory usage, improving average score from 50.3 without RL to 58.8 with RL across benchmarks.

RESULTS

By the Numbers

HotpotQA 200doc

77.8

+1.3 over MemAgent

2WikiMQA 800doc

62.5

+4.8 over MemAgent

Musique 800doc

48.5

+4.0 over MemAgent

Average Score

58.8

+2.1 over MemAgent

The main table reports exact match scores on HotpotQA, 2WikiMultiHopQA, Musique, GAIA, and WebWalkerQA. AtomMem’s 58.8 average score shows that atomic CRUD with GRPO yields better long-context and web performance than MemAgent under the same Qwen3-8B backbone.

BENCHMARK

By the Numbers

BENCHMARK

Results on long-context QA benchmarks and multi-turn web benchmarks

Average score across HotpotQA, 2WikiMQA, Musique, GAIA, and WebWalkerQA.

BENCHMARK

Ablation study of memory operations and memory components

HotpotQA exact match under different AtomMem ablations.

KEY INSIGHT

The Counterintuitive Finding

During RL training on QA tasks, AtomMem reduces Read frequency while increasing Create, Update, and Delete usage, yet overall performance rises by about 9 points.

This is surprising because many static designs assume more reading is always safer, but AtomMem shows that compact, aggressively maintained memory can be more effective than constant retrieval.

WHY IT MATTERS

What this unlocks for the field

AtomMem unlocks a general, task-agnostic CRUD interface where memory behavior is learned rather than hard-coded, enabling adaptive policies across tasks.

Builders can now treat memory as a trainable policy over scratchpad and storage, making it practical to deploy agents that manage long-horizon context without bespoke workflows for every environment.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…