Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

AuthorsYi Yu, Liuyi Yao, Yuexiang Xie et al.

arXiv 20262026

TL;DR

Agentic Memory (AgeMem) unifies tool-based long-term and short-term memory control with step-wise GRPO, reaching 54.31% average across five benchmarks vs 45.74% for A-Mem (+8.57 points).

THE PROBLEM

Fragmented agent memory hurts long-horizon reasoning and context efficiency

LLM agents with finite context windows rely on separately tuned LTM and STM, leading to fragmented memory construction and suboptimal performance in long-horizon reasoning tasks.

Trigger-based and agent-based memory controllers depend on handcrafted rules or auxiliary expert models, increasing system complexity and limiting end-to-end optimization for real-world agent workflows.

HOW IT WORKS

Agentic Memory — unified tool-based LTM and STM with step-wise GRPO

Agentic Memory (AgeMem) integrates memory management tools, a three-stage progressive RL strategy, and step-wise GRPO into the agent policy to jointly manage LTM and STM.

You can think of AgeMem as giving the agent its own RAM and disk controller, deciding when to cache, compress, or archive information instead of relying on fixed retrieval heuristics.

This unified control lets AgeMem learn when to store, retrieve, summarize, filter, update, or delete information in ways a plain context window or static RAG pipeline cannot.

DIAGRAM

Agentic Memory interaction flow across user, tools, and memories

This diagram shows how Agentic Memory (AgeMem) uses tool calls to coordinate long-term and short-term memory during multi-round interactions.

DIAGRAM

Three-stage progressive RL and evaluation pipeline for AgeMem

This diagram shows how Agentic Memory (AgeMem) is trained and evaluated with three-stage trajectories and step-wise GRPO across five benchmarks.

PROCESS

How Agentic Memory Handles a Three-stage Trajectory

  1. 01

    Stage 1 LTM construction

    Agentic Memory (AgeMem) interacts casually using memory management tools to populate long-term memory while logging experiences for the three-stage progressive RL strategy.

  2. 02

    Stage 2 STM control under distractors

    Agentic Memory (AgeMem) resets short-term context, faces distractors, and uses STM tools like SUMMARY and FILTER to manage context efficiently.

  3. 03

    Stage 3 integrated reasoning and memory coordination

    Agentic Memory (AgeMem) receives the final query, invokes RETRIEVE on long-term memory, and coordinates STM and LTM to answer accurately.

  4. 04

    Step-wise GRPO optimization

    Agentic Memory (AgeMem) applies step-wise GRPO to trajectories, broadcasting terminal rewards to earlier memory decisions for unified policy updates.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Agentic Memory unified framework

    Agentic Memory (AgeMem) introduces a unified agentic memory framework where memory management tools expose ADD, UPDATE, DELETE, RETRIEVE, SUMMARY, and FILTER directly to the policy.

  • 02

    Three-stage progressive RL strategy

    Agentic Memory (AgeMem) uses a three-stage progressive RL strategy to separately acquire LTM storage, STM control, and then coordinated reasoning capabilities.

  • 03

    Step-wise GRPO for fragmented rewards

    Agentic Memory (AgeMem) designs step-wise GRPO to transform cross-stage dependencies into learnable signals, stabilizing RL over fragmented memory trajectories.

RESULTS

By the Numbers

Average

54.31%

+8.57 over A-Mem on Qwen3-4B-Instruct

ALFWorld

48.97%

+7.80 over Mem0 on Qwen3-4B-Instruct

SciWorld

59.48%

+8.10 over Mem0 on Qwen3-4B-Instruct

HotpotQA

55.49%

+7.01 over A-Mem on Qwen3-4B-Instruct

On ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, Agentic Memory (AgeMem) is evaluated for long-horizon reasoning and memory quality. The 54.31% average vs 45.74% for A-Mem shows that unified, tool-based memory with step-wise GRPO yields stronger agent performance across diverse environments.

BENCHMARK

By the Numbers

On ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, Agentic Memory (AgeMem) is evaluated for long-horizon reasoning and memory quality. The 54.31% average vs 45.74% for A-Mem shows that unified, tool-based memory with step-wise GRPO yields stronger agent performance across diverse environments.

BENCHMARK

Benchmark: Performance comparison across five benchmarks (Qwen3-4B-Instruct)

Average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA on Qwen3-4B-Instruct.

BENCHMARK

Benchmark: Memory Quality on HotpotQA

Memory Quality (MQ) score on HotpotQA using Qwen3-4B-Instruct.

KEY INSIGHT

The Counterintuitive Finding

Agentic Memory (AgeMem) increases average tool calls from 4.33 to 4.92 on Qwen2.5-7B-Instruct but still reduces prompt tokens by 3.1% on HotpotQA.

More memory operations usually suggest higher overhead, yet AgeMem shows that smarter STM tools and GRPO can shrink context while improving reasoning quality.

WHY IT MATTERS

What this unlocks for the field

Agentic Memory (AgeMem) unlocks agents that learn when to store, retrieve, summarize, filter, update, and delete information instead of following rigid memory heuristics.

Builders can now train a single LLM agent to manage both long-term and short-term memory end-to-end, enabling scalable long-horizon workflows without auxiliary expert controllers.

~14 min read← Back to papers

Related papers

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes Memory-Augmented Generation into four structures and empirically compares systems like LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem under benchmark saturation, metric validity, backbone sensitivity, and system cost. On the LoCoMo benchmark, Anatomy of Agentic Memory shows Nemori reaches 0.502 F1 while AMem drops to 0.116, and MAGMA achieves the top semantic judge score of 0.670 under the MAGMA rubric.

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

Agent MemoryMemory Architecture

General Agentic Memory Via Deep Research

B.Y. Yan, Chaofan Li et al.

arXiv 2025 · 2025

General Agentic Memory (GAM) combines a **Memorizer**, **Researcher**, **page-store**, and **memory** to keep full trajectories while constructing lightweight guidance for deep research. On RULER 128K retrieval, GAM achieves 97.70% accuracy compared to 94.25% for RAG using GPT-4o-mini, while also reaching 64.07 F1 on HotpotQA-56K.