MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

AuthorsWeizhi Zhang, Xiaokai Wei, Wei-Chieh Huang et al.

2026

TL;DR

MEMORYCD uses real long-horizon Amazon user histories plus cross-domain memory sources to expose that 14 long-context LLMs with 6 memory methods still yield low ROUGE≤0.222 and NDCG@3≤0.610 on lifelong personalization.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory benchmarks ignore lifelong cross-domain users and long contexts

Most existing memory benchmarks are synthetic and short, while MEMORYCD reports 400K sessions with 1K+ context length and real user feedback.

These limitations mean long-context LLM agents fail to capture cross-domain preferences, hurting rating prediction, item ranking, and personalized review generation grounded in real behavior.

HOW IT WORKS

MEMORYCD — cross-domain long-context user memory benchmark

MEMORYCD constructs a long-term behavioral record pool Mu, defines four personalization tasks, and evaluates six memory methods including Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem.

You can think of MEMORYCD as a stress test where user histories are a huge disk, memory methods are index structures, and LLMs are the CPU querying them.

This design lets MEMORYCD probe how different memory mechanisms organize and route information beyond a plain context window, directly measuring downstream user satisfaction.

DIAGRAM

Single vs cross-domain memory flow in MEMORYCD

This diagram shows how MEMORYCD routes single-domain and cross-domain user histories into tasks for personalization evaluation.

DIAGRAM

MEMORYCD evaluation pipeline over LLMs and memory methods

This diagram shows how MEMORYCD combines 14 long-context LLMs with 6 memory methods across four domains and four tasks.

PROCESS

How MEMORYCD Handles a Lifelong Personalization Evaluation

  1. 01

    Memory evaluation setting

    MEMORYCD defines user memory Mu from Amazon Review histories and chooses single-domain or cross-domain memory source settings for each user.

  2. 02

    Four personalization tasks

    MEMORYCD instantiates personalized rating prediction, personalized item ranking, personalized review summarization, and personalized review generation for each Mu.

  3. 03

    Long-context prompting and memory methods

    MEMORYCD feeds Mu into long-context prompting and memory systems like Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem with 14 LLM backbones.

  4. 04

    End to end evaluation

    MEMORYCD computes MAE, RMSE, NDCG@K, ROUGE-L, and BLEU-1 to quantify how well agents approximate real user behavior across domains.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Cross-domain long-context memory benchmark

    MEMORYCD constructs 400K sessions with 1K+ context length from Amazon Review, forming Mu that spans 12 domains with real user feedback.

  • 02

    End to end personalization tasks

    MEMORYCD defines four tasks—rating prediction, item ranking, review summarization, and review generation—to measure user decision making and text style alignment.

  • 03

    Comprehensive LLM and memory evaluation

    MEMORYCD evaluates 14 frontier long-context LLMs with six memory systems including Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem across Books, Electronics, Home and Kitchen, and Personal Care.

RESULTS

By the Numbers

RMSE (Books rating)

0.624

0.192 lower RMSE than GPT-5-Nano on Books rating prediction

NDCG@3 (Electronics ranking)

0.604

0.247 higher NDCG@3 than GPT-5-Nano on Electronics ranking

ROUGE (Books generation)

0.162

0.052 higher ROUGE than GPT-5-Nano on Books review generation

ROUGE (Home and Kitchen generation)

0.178

0.071 higher ROUGE than GPT-5-Nano on Home and Kitchen generation

MEMORYCD reports these metrics on Amazon Review based Books, Electronics, and Home and Kitchen domains, showing GPT-5 and Gemini-2.5 Pro still leave large gaps to real user behavior despite long-context memory.

BENCHMARK

By the Numbers

MEMORYCD reports these metrics on Amazon Review based Books, Electronics, and Home and Kitchen domains, showing GPT-5 and Gemini-2.5 Pro still leave large gaps to real user behavior despite long-context memory.

BENCHMARK

Performance comparison of frontier LLMs in single-domain with long-context memory prompting (Books)

RMSE on Books rating prediction under long-context prompting.

KEY INSIGHT

The Counterintuitive Finding

On Books, long-context GPT-5 achieves the best MAE 0.330 for rating prediction, yet its review generation ROUGE is only 0.132.

This breaks the assumption that better numeric preference modeling automatically yields equally strong personalized language generation quality for the same users.

WHY IT MATTERS

What this unlocks for the field

MEMORYCD enables systematic testing of how different memory mechanisms and cross-domain sources impact real user aligned personalization over years of behavior.

Builders can now benchmark lifelong personal assistants, choose memory designs like MemoryBank or A-Mem, and tune cross-domain transfer rather than relying on short synthetic dialogs.

~12 min read← Back to papers

Related papers

Long-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Questions about this paper?

Paper: MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Answers use this explainer on Memory Papers.

Checking…