AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

AuthorsCheng Jiayang, Dongyu Ru, Lin Qiu et al.

2026

TL;DR

AMemGym grounds long-horizon conversations in structured user state evolution to enable on-policy memory diagnosis, revealing up to 0.291 memory score for AWE-(2,4,30) and large rank shifts versus off-policy evaluation.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Off-policy memory benchmarks mislead agent design with reuse bias

Existing memory benchmarks rely on static, off-policy data, and AMemGym shows AWE-(2,4,30) shifts from 0.253 to 0.291 memory score when evaluated on-policy.

This reuse bias means conversational assistants tuned on off-policy traces can select suboptimal memory configurations, harming long-horizon personalization quality in real interactive deployments.

HOW IT WORKS

AMemGym — structured state evolution for on-policy memory evaluation

AMemGym’s core mechanism chains Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation so assistants are tested against controlled user state trajectories and diagnostic feedback.

You can think of AMemGym like a game engine where the world state is a structured database, and the dialogue is the rendered scene exposing only parts of that state.

This grounding lets AMemGym probe write, read, and utilization failures in ways a plain context window benchmark cannot, because every conversational turn is tied back to precise latent user states.

DIAGRAM

Interactive conversation loop in AMemGym

This diagram shows how AMemGym’s simulated user, assistant, and structured state blueprint interact during each conversation period.

DIAGRAM

AMemGym evaluation and diagnosis pipeline

This diagram shows how AMemGym computes overall accuracy, normalized memory score, and write–read–utilization failure rates from interactions.

PROCESS

How AMemGym Handles a Long-Horizon Conversation Session

01
Structured Data Generation
AMemGym runs Structured Data Generation to sample user profiles, build a global state schema, and script user state evolution trajectories for each persona.
02
On-Policy Interaction
During On-Policy Interaction, AMemGym’s LLM-simulated user exposes selected state variables through grounded utterances while the assistant updates its memory.
03
Evaluation Metrics
After each evolution period, AMemGym’s Evaluation Metrics module asks state-dependent questions and computes overall accuracy and normalized memory scores.
04
Diagnostic Evaluation
AMemGym’s Diagnostic Evaluation decomposes errors into write, read, and utilization failures by comparing assistant state queries against the structured state trajectory.

KEY CONTRIBUTIONS

Key Contributions

01
Interactive environment for on-policy evaluation
AMemGym introduces an interactive environment where Structured Data Generation and On-Policy Interaction yield scalable, evaluation-aligned long-horizon conversations for memory benchmarking.
02
Comprehensive diagnosis of memory systems
AMemGym’s Evaluation Metrics and Diagnostic Evaluation decompose performance into write, read, and utilization failures, revealing why systems like AWE and RAG behave differently.
03
Proof of concept for agent self-evolution
Using AMemGym’s feedback, an Agentic Write (In-Context) policy improves memory score from 0.172 to 0.197 and reduces write failures from 0.293 to 0.263 under Complete Feedback.

RESULTS

By the Numbers

On-policy memory score

0.291

+0.088 over LLM

Off-policy memory score

0.253

vs AWE-(2,4,30) on-policy 0.291

LLM memory score

0.203

gpt-4.1-mini baseline without external memory

Meta-eval state exposure

99.1%

state exposure quality with 96.8% Gwet’s AC1

These numbers come from AMemGym’s base configuration, which tests long-horizon personalization with 10 evolution periods and 128K+ token contexts. The 0.291 memory score for AWE-(2,4,30) versus 0.203 for LLM shows AMemGym can clearly separate agentic memory systems from plain long-context LLMs under interactive evaluation.

BENCHMARK

By the Numbers

BENCHMARK

On-policy vs Off-policy Memory Scores on AMemGym

Normalized memory score on AMemGym base configuration for AWE-(2,4,30), RAG-(2,4,30), LLM, and AWI.

KEY INSIGHT

The Counterintuitive Finding

AMemGym shows that AWE-(2,4,30) ranks first on-policy with 0.291 memory score but only 0.253 off-policy, dropping three ranks under static evaluation.

This is surprising because static long-context benchmarks assume dialogue understanding proxies interactive performance, yet AMemGym reveals reuse bias can invert configuration rankings for the same assistant.

WHY IT MATTERS

What this unlocks for the field

AMemGym gives researchers a controllable, on-policy playground to stress-test how assistants write, read, and use memory over many conversational periods.

Builders can now tune and evolve memory agents directly against interactive feedback, rather than guessing from static logs that hide reuse bias and long-context failures.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…