Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

AuthorsZexue He, Yu Wang, Churan Zhi et al.

2026

TL;DR

MEMORYARENA uses multi-session Memory–Agent–Environment loops with interdependent subtasks to show state-of-the-art memory agents still achieve near-zero success rates on realistic long-horizon tasks.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory benchmarks miss agentic failures despite near-saturated long-context scores

Existing long-context memory benchmarks like LoCoMo report near-saturated performance, yet they only test static recall without actions or environment dynamics.

When agents face multi-session tasks with interdependent subtasks, MEMORYARENA shows Task Success Rates dropping to 0.00–0.12, meaning agents fail to solve realistic long-horizon goals.

HOW IT WORKS

MEMORYARENA — Memory-Agent-Environment loops for multi-session tasks

MEMORYARENA centers on Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, and Group Travel Planning to couple memorization with agentic actions and feedback.

You can think of MEMORYARENA like a computer with RAM and disk: sessions are processes, and the memory system is an external store that must persist state between runs.

This design lets MEMORYARENA expose failures that a plain context window cannot, because crucial information disappears once a session ends and must be explicitly written to and read from persistent memory.

DIAGRAM

Multi-session Memory-Agent-Environment loop across interdependent subtasks

This diagram shows how MEMORYARENA executes a sequence of interdependent subtasks as separate sessions, each updating and querying persistent memory.

DIAGRAM

Evaluation pipeline across four MEMORYARENA environments

This diagram shows how MEMORYARENA builds and evaluates multi-session tasks in four environments with GPT-5.1-mini plus different memory systems.

PROCESS

How MEMORYARENA Handles a Multi-Session Working Flow

  1. 01

    Task Composition and Data Preparation

    MEMORYARENA constructs Bundled Web Shopping, Group Travel Planning, Progressive Web Search, and formal reasoning tasks with explicitly interdependent subtasks and verified causal chains.

  2. 02

    Single-Session Agent-Environment Interactions

    Within each subtask, MEMORYARENA lets the LLM Agent interact stepwise with the Environment, collecting actions and observations as a session trace.

  3. 03

    Multi-session Agent-Environment Interactions

    MEMORYARENA sequences subtasks so later sessions depend on earlier ones, making previous traces inaccessible except through the memory system.

  4. 04

    Final: Memory-Agent-Environment Loop

    At each subtask, MEMORYARENA calls RETRIEVE and UPDATE on the memory system, then evaluates Task Success Rate and Task Progress Score across all sessions.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    MEMORYARENA: Agent Memory in Memory-Action-Environment Loops

    MEMORYARENA formalizes Memory-Agent-Environment Loops and Multi-Session Working Flow to couple memorization with actions, feedback, and persistent memory across sessions.

  • 02

    Four interdependent evaluation environments

    MEMORYARENA provides Bundled Web Shopping, Group Travel Planning, Progressive Web Search, and formal math and physics reasoning, totaling 766 tasks with an average of 57 action steps.

  • 03

    Unified benchmarking of memory paradigms

    MEMORYARENA evaluates long-context agents, RAG systems, and external memory agents like MemGPT, Mem0, GraphRAG, and ReasoningBank under the same multi-session setting.

RESULTS

By the Numbers

Task Success Rate

0.12

GPT-5.1-mini long-context vs 0.00 for many memory agents on Bundled Web Shopping

Task Progress Score

0.79

Claude-Sonnet-4.5 long-context PS on Bundled Web Shopping vs 0.41 Memory Avg

Soft Process Score

0.52

GPT-5.1-mini long-context sPS on Group Travel Planning vs 0.38 All Method Avg

Average Steps per Task

57

Average action steps per MEMORYARENA task across environments

MEMORYARENA reports Task Success Rate, Task Progress Score, and soft Process Score across Bundled Web Shopping, Group Travel Planning, Progressive Web Search, and Formal Reasoning. These numbers show that even strong long-context and memory-augmented agents struggle to maintain and reuse information across interdependent multi-session tasks.

BENCHMARK

By the Numbers

MEMORYARENA reports Task Success Rate, Task Progress Score, and soft Process Score across Bundled Web Shopping, Group Travel Planning, Progressive Web Search, and Formal Reasoning. These numbers show that even strong long-context and memory-augmented agents struggle to maintain and reuse information across interdependent multi-session tasks.

BENCHMARK

Main results on task agent (gpt-5.1-mini) with long-context memory, memory agent, and RAG agent

Task Success Rate on Bundled Web Shopping for GPT-5.1-mini with different memory paradigms.

KEY INSIGHT

The Counterintuitive Finding

MEMORYARENA shows Group Travel Planning has near-zero Task Success Rate and Process Score for all methods, despite sophisticated memory systems.

This is surprising because agents that nearly saturate LoCoMo and other long-context benchmarks were expected to transfer those gains to realistic multi-session planning, but MEMORYARENA reveals they do not.

WHY IT MATTERS

What this unlocks for the field

MEMORYARENA unlocks a way to test whether memory actually supports long-horizon decision-making, not just static recall from long contexts.

Builders can now design and compare memory systems under realistic Memory-Agent-Environment Loops, targeting belief tracking and cross-session state management that were previously invisible to standard benchmarks.

~12 min read← Back to papers

Related papers

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes Agent IO Layer, Agent Cache Layer, Agent Memory Layer, Agent Cache Sharing, and Agent Memory Access Protocol into a computer-architecture-style design for LLM agents. Multi-Agent Memory Architecture’s main result is a conceptual unification of shared and distributed memory plus a research agenda for multi-agent memory consistency instead of benchmark gains.

Survey

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

Yaxiong Wu, Sheng Liang et al.

arXiv 2025 · 2025

From Human Memory to AI Memory maps human memory categories onto AI memory using the 3D-8Q taxonomy with Personal Memory, System Memory, and the Three-Dimensional Eight-Quadrant Memory Taxonomy. The main result is that From Human Memory to AI Memory systematically organizes memory in LLM-driven AI systems across eight quadrants defined by object, form, and time, connecting them to human memory types.

Questions about this paper?

Paper: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Answers use this explainer on Memory Papers.

Checking…