MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

AuthorsWeizhi Zhang, Xiaokai Wei, Wei-Chieh Huang et al.

2026

TL;DR

MEMORYCD uses real long-horizon Amazon user histories plus cross-domain memory sources to expose that 14 long-context LLMs with 6 memory methods still yield low ROUGE≤0.222 and NDCG@3≤0.610 on lifelong personalization.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory benchmarks ignore lifelong cross-domain users and long contexts

Most existing memory benchmarks are synthetic and short, while MEMORYCD reports 400K sessions with 1K+ context length and real user feedback.

These limitations mean long-context LLM agents fail to capture cross-domain preferences, hurting rating prediction, item ranking, and personalized review generation grounded in real behavior.

HOW IT WORKS

MEMORYCD — cross-domain long-context user memory benchmark

MEMORYCD constructs a long-term behavioral record pool Mu, defines four personalization tasks, and evaluates six memory methods including Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem.

You can think of MEMORYCD as a stress test where user histories are a huge disk, memory methods are index structures, and LLMs are the CPU querying them.

This design lets MEMORYCD probe how different memory mechanisms organize and route information beyond a plain context window, directly measuring downstream user satisfaction.

DIAGRAM

Single vs cross-domain memory flow in MEMORYCD

This diagram shows how MEMORYCD routes single-domain and cross-domain user histories into tasks for personalization evaluation.

DIAGRAM

MEMORYCD evaluation pipeline over LLMs and memory methods

This diagram shows how MEMORYCD combines 14 long-context LLMs with 6 memory methods across four domains and four tasks.

PROCESS

How MEMORYCD Handles a Lifelong Personalization Evaluation

01
Memory evaluation setting
MEMORYCD defines user memory Mu from Amazon Review histories and chooses single-domain or cross-domain memory source settings for each user.
02
Four personalization tasks
MEMORYCD instantiates personalized rating prediction, personalized item ranking, personalized review summarization, and personalized review generation for each Mu.
03
Long-context prompting and memory methods
MEMORYCD feeds Mu into long-context prompting and memory systems like Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem with 14 LLM backbones.
04
End to end evaluation
MEMORYCD computes MAE, RMSE, NDCG@K, ROUGE-L, and BLEU-1 to quantify how well agents approximate real user behavior across domains.

KEY CONTRIBUTIONS

Key Contributions

01
Cross-domain long-context memory benchmark
MEMORYCD constructs 400K sessions with 1K+ context length from Amazon Review, forming Mu that spans 12 domains with real user feedback.
02
End to end personalization tasks
MEMORYCD defines four tasks—rating prediction, item ranking, review summarization, and review generation—to measure user decision making and text style alignment.
03
Comprehensive LLM and memory evaluation
MEMORYCD evaluates 14 frontier long-context LLMs with six memory systems including Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem across Books, Electronics, Home and Kitchen, and Personal Care.

RESULTS

By the Numbers

RMSE (Books rating)

0.624

0.192 lower RMSE than GPT-5-Nano on Books rating prediction

NDCG@3 (Electronics ranking)

0.604

0.247 higher NDCG@3 than GPT-5-Nano on Electronics ranking

ROUGE (Books generation)

0.162

0.052 higher ROUGE than GPT-5-Nano on Books review generation

ROUGE (Home and Kitchen generation)

0.178

0.071 higher ROUGE than GPT-5-Nano on Home and Kitchen generation

MEMORYCD reports these metrics on Amazon Review based Books, Electronics, and Home and Kitchen domains, showing GPT-5 and Gemini-2.5 Pro still leave large gaps to real user behavior despite long-context memory.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison of frontier LLMs in single-domain with long-context memory prompting (Books)

RMSE on Books rating prediction under long-context prompting.

KEY INSIGHT

The Counterintuitive Finding

On Books, long-context GPT-5 achieves the best MAE 0.330 for rating prediction, yet its review generation ROUGE is only 0.132.

This breaks the assumption that better numeric preference modeling automatically yields equally strong personalized language generation quality for the same users.

WHY IT MATTERS

What this unlocks for the field

MEMORYCD enables systematic testing of how different memory mechanisms and cross-domain sources impact real user aligned personalization over years of behavior.

Builders can now benchmark lifelong personal assistants, choose memory designs like MemoryBank or A-Mem, and tune cross-domain transfer rather than relying on short synthetic dialogs.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…