AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

AuthorsCheng Jiayang, Dongyu Ru, Lin Qiu et al.

2026

TL;DR

AMemGym grounds long-horizon conversations in structured user state evolution to enable on-policy memory diagnosis, revealing up to 0.291 memory score for AWE-(2,4,30) and large rank shifts versus off-policy evaluation.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Off-policy memory benchmarks mislead agent design with reuse bias

Existing memory benchmarks rely on static, off-policy data, and AMemGym shows AWE-(2,4,30) shifts from 0.253 to 0.291 memory score when evaluated on-policy.

This reuse bias means conversational assistants tuned on off-policy traces can select suboptimal memory configurations, harming long-horizon personalization quality in real interactive deployments.

HOW IT WORKS

AMemGym — structured state evolution for on-policy memory evaluation

AMemGym’s core mechanism chains Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation so assistants are tested against controlled user state trajectories and diagnostic feedback.

You can think of AMemGym like a game engine where the world state is a structured database, and the dialogue is the rendered scene exposing only parts of that state.

This grounding lets AMemGym probe write, read, and utilization failures in ways a plain context window benchmark cannot, because every conversational turn is tied back to precise latent user states.

DIAGRAM

Interactive conversation loop in AMemGym

This diagram shows how AMemGym’s simulated user, assistant, and structured state blueprint interact during each conversation period.

DIAGRAM

AMemGym evaluation and diagnosis pipeline

This diagram shows how AMemGym computes overall accuracy, normalized memory score, and write–read–utilization failure rates from interactions.

PROCESS

How AMemGym Handles a Long-Horizon Conversation Session

  1. 01

    Structured Data Generation

    AMemGym runs Structured Data Generation to sample user profiles, build a global state schema, and script user state evolution trajectories for each persona.

  2. 02

    On-Policy Interaction

    During On-Policy Interaction, AMemGym’s LLM-simulated user exposes selected state variables through grounded utterances while the assistant updates its memory.

  3. 03

    Evaluation Metrics

    After each evolution period, AMemGym’s Evaluation Metrics module asks state-dependent questions and computes overall accuracy and normalized memory scores.

  4. 04

    Diagnostic Evaluation

    AMemGym’s Diagnostic Evaluation decomposes errors into write, read, and utilization failures by comparing assistant state queries against the structured state trajectory.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Interactive environment for on-policy evaluation

    AMemGym introduces an interactive environment where Structured Data Generation and On-Policy Interaction yield scalable, evaluation-aligned long-horizon conversations for memory benchmarking.

  • 02

    Comprehensive diagnosis of memory systems

    AMemGym’s Evaluation Metrics and Diagnostic Evaluation decompose performance into write, read, and utilization failures, revealing why systems like AWE and RAG behave differently.

  • 03

    Proof of concept for agent self-evolution

    Using AMemGym’s feedback, an Agentic Write (In-Context) policy improves memory score from 0.172 to 0.197 and reduces write failures from 0.293 to 0.263 under Complete Feedback.

RESULTS

By the Numbers

On-policy memory score

0.291

+0.088 over LLM

Off-policy memory score

0.253

vs AWE-(2,4,30) on-policy 0.291

LLM memory score

0.203

gpt-4.1-mini baseline without external memory

Meta-eval state exposure

99.1%

state exposure quality with 96.8% Gwet’s AC1

These numbers come from AMemGym’s base configuration, which tests long-horizon personalization with 10 evolution periods and 128K+ token contexts. The 0.291 memory score for AWE-(2,4,30) versus 0.203 for LLM shows AMemGym can clearly separate agentic memory systems from plain long-context LLMs under interactive evaluation.

BENCHMARK

By the Numbers

These numbers come from AMemGym’s base configuration, which tests long-horizon personalization with 10 evolution periods and 128K+ token contexts. The 0.291 memory score for AWE-(2,4,30) versus 0.203 for LLM shows AMemGym can clearly separate agentic memory systems from plain long-context LLMs under interactive evaluation.

BENCHMARK

On-policy vs Off-policy Memory Scores on AMemGym

Normalized memory score on AMemGym base configuration for AWE-(2,4,30), RAG-(2,4,30), LLM, and AWI.

KEY INSIGHT

The Counterintuitive Finding

AMemGym shows that AWE-(2,4,30) ranks first on-policy with 0.291 memory score but only 0.253 off-policy, dropping three ranks under static evaluation.

This is surprising because static long-context benchmarks assume dialogue understanding proxies interactive performance, yet AMemGym reveals reuse bias can invert configuration rankings for the same assistant.

WHY IT MATTERS

What this unlocks for the field

AMemGym gives researchers a controllable, on-policy playground to stress-test how assistants write, read, and use memory over many conversational periods.

Builders can now tune and evolve memory agents directly against interactive feedback, rather than guessing from static logs that hide reuse bias and long-context failures.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Agent Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

Questions about this paper?

Paper: AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Answers use this explainer on Memory Papers.

Checking…