Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

AuthorsYi Yu, Liuyi Yao, Yuexiang Xie et al.

arXiv 20262026

TL;DR

Agentic Memory (AgeMem) unifies tool-based long-term and short-term memory control with step-wise GRPO, reaching 54.31% average across five benchmarks vs 45.74% for A-Mem (+8.57 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Fragmented agent memory hurts long-horizon reasoning and context efficiency

LLM agents with finite context windows rely on separately tuned LTM and STM, leading to fragmented memory construction and suboptimal performance in long-horizon reasoning tasks.

Trigger-based and agent-based memory controllers depend on handcrafted rules or auxiliary expert models, increasing system complexity and limiting end-to-end optimization for real-world agent workflows.

HOW IT WORKS

Agentic Memory — unified tool-based LTM and STM with step-wise GRPO

Agentic Memory (AgeMem) integrates memory management tools, a three-stage progressive RL strategy, and step-wise GRPO into the agent policy to jointly manage LTM and STM.

You can think of AgeMem as giving the agent its own RAM and disk controller, deciding when to cache, compress, or archive information instead of relying on fixed retrieval heuristics.

This unified control lets AgeMem learn when to store, retrieve, summarize, filter, update, or delete information in ways a plain context window or static RAG pipeline cannot.

DIAGRAM

Agentic Memory interaction flow across user, tools, and memories

This diagram shows how Agentic Memory (AgeMem) uses tool calls to coordinate long-term and short-term memory during multi-round interactions.

DIAGRAM

Three-stage progressive RL and evaluation pipeline for AgeMem

This diagram shows how Agentic Memory (AgeMem) is trained and evaluated with three-stage trajectories and step-wise GRPO across five benchmarks.

PROCESS

How Agentic Memory Handles a Three-stage Trajectory

01
Stage 1 LTM construction
Agentic Memory (AgeMem) interacts casually using memory management tools to populate long-term memory while logging experiences for the three-stage progressive RL strategy.
02
Stage 2 STM control under distractors
Agentic Memory (AgeMem) resets short-term context, faces distractors, and uses STM tools like SUMMARY and FILTER to manage context efficiently.
03
Stage 3 integrated reasoning and memory coordination
Agentic Memory (AgeMem) receives the final query, invokes RETRIEVE on long-term memory, and coordinates STM and LTM to answer accurately.
04
Step-wise GRPO optimization
Agentic Memory (AgeMem) applies step-wise GRPO to trajectories, broadcasting terminal rewards to earlier memory decisions for unified policy updates.

KEY CONTRIBUTIONS

Key Contributions

01
Agentic Memory unified framework
Agentic Memory (AgeMem) introduces a unified agentic memory framework where memory management tools expose ADD, UPDATE, DELETE, RETRIEVE, SUMMARY, and FILTER directly to the policy.
02
Three-stage progressive RL strategy
Agentic Memory (AgeMem) uses a three-stage progressive RL strategy to separately acquire LTM storage, STM control, and then coordinated reasoning capabilities.
03
Step-wise GRPO for fragmented rewards
Agentic Memory (AgeMem) designs step-wise GRPO to transform cross-stage dependencies into learnable signals, stabilizing RL over fragmented memory trajectories.

RESULTS

By the Numbers

Average

54.31%

+8.57 over A-Mem on Qwen3-4B-Instruct

ALFWorld

48.97%

+7.80 over Mem0 on Qwen3-4B-Instruct

SciWorld

59.48%

+8.10 over Mem0 on Qwen3-4B-Instruct

HotpotQA

55.49%

+7.01 over A-Mem on Qwen3-4B-Instruct

On ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, Agentic Memory (AgeMem) is evaluated for long-horizon reasoning and memory quality. The 54.31% average vs 45.74% for A-Mem shows that unified, tool-based memory with step-wise GRPO yields stronger agent performance across diverse environments.

BENCHMARK

By the Numbers

BENCHMARK

Benchmark: Performance comparison across five benchmarks (Qwen3-4B-Instruct)

Average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA on Qwen3-4B-Instruct.

BENCHMARK

Benchmark: Memory Quality on HotpotQA

Memory Quality (MQ) score on HotpotQA using Qwen3-4B-Instruct.

KEY INSIGHT

The Counterintuitive Finding

Agentic Memory (AgeMem) increases average tool calls from 4.33 to 4.92 on Qwen2.5-7B-Instruct but still reduces prompt tokens by 3.1% on HotpotQA.

More memory operations usually suggest higher overhead, yet AgeMem shows that smarter STM tools and GRPO can shrink context while improving reasoning quality.

WHY IT MATTERS

What this unlocks for the field

Agentic Memory (AgeMem) unlocks agents that learn when to store, retrieve, summarize, filter, update, and delete information instead of following rigid memory heuristics.

Builders can now train a single LLM agent to manage both long-term and short-term memory end-to-end, enabling scalable long-horizon workflows without auxiliary expert controllers.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…