Category

Personalization

Personalized memory for AI assistants — user preference learning, persona modeling, and adaptive long-horizon dialogue.

10 papers

BenchmarkBenchmarkLong-Term Memory

A-MBER: Affective Memory Benchmark for Emotion Recognition

Deliang Wen, Ke Sun, Yu Wang

· 2026

A-MBER builds multi-session conversational scenarios via a staged pipeline of persona specification, long-horizon planning, conversation generation, annotation, question construction, and benchmark-unit packaging. On A-MBER, a structured memory system reaches 0.69 judgment accuracy, 0.66 retrieval, and 0.65 explanation versus 0.34, 0.29, and 0.31 for a no-memory baseline.

BenchmarkBenchmarkLong-Term Memory

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon, Sunkyoung Kim et al.

· 2026

BenchPreS combines Contexts, User Profiles, Preference Attributes, Gold Labeling, and an LLM-as-Judge framework to test context-aware preference selectivity in persistent-memory LLMs. BenchPreS shows GPT-5.2 reaches 87.33% Appropriate Application Rate on BenchPreS while still having a 40.95% Misapplication Rate compared to Gemini 3 Pro’s 86.48% Misapplication Rate.

RAGBenchmarkLong-Term Memory

MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Shu Wang, Edwin Yu et al.

· 2026

MemMachine combines Short-term memory, Long-term memory, Profile memory, and the Retrieval Agent to store raw conversational episodes and retrieve clustered context around nucleus matches. On LoCoMo, MemMachine scores 0.9169 with gpt-4.1-mini while using about 80% fewer input tokens than Mem0, and reaches 93.0% on LongMemEvalS with GPT-5-mini.

BenchmarkBenchmarkAgent Memory

MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Weizhi Zhang, Xiaokai Wei et al.

· 2026

MEMORYCD builds a user memory pool Mu from lifelong Amazon Review histories and evaluates long-context prompting, Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem across rating, ranking, and personalized text tasks. On Books and Home & Kitchen, MEMORYCD shows GPT-5 reaches RMSE 0.551–0.624 and NDCG@3 up to 0.610, while Gemini-2.5 Pro peaks at ROUGE-L 0.222 for generation, revealing substantial remaining gaps to real user behavior.

Benchmark

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

Zhen Tan, Jun Yan et al.

ACL 2025 · 2025

Reflective Memory Management (RMM) uses a memory bank, retriever, reranker, and LLM to implement Prospective Reflection and Retrospective Reflection for topic-based storage and RL-based retrieval refinement. On LongMemEval, RMM with GTE achieves 69.8% Recall@5 and 70.4% accuracy, compared to 62.4% Recall@5 and 63.6% accuracy for GTE RAG.

BenchmarkBenchmarkAgent Memory

Memoria: A Scalable Agentic Memory Framework for Personalized Conversational AI

Samarth Sarin, Lovepreet Singh et al.

· 2025

Memoria augments LLM chats with structured conversation logging, dynamic user persona via KG, session level memory for real time context, and seamless retrieval for context aware responses to provide persistent, interpretable memory. On LongMemEvals single-session-user and knowledge-update subsets, Memoria reaches 87.1% and 80.8% accuracy respectively, surpassing A-Mem (OpenAI) while using much shorter prompts.

BenchmarkAgent Memory

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

Bowen Jiang, Yuan Yuan et al.

· 2025

PersonaMem-v2 combines PERSONAMEM-V2: IMPLICIT PERSONAS, RL with Long-Context Reasoning, RL with Agentic Memory, and a User Privacy-Aware Design to train Qwen3-4B with GRPO on implicit user preferences from long, noisy histories. PersonaMem-v2 achieves 55.2% MCQ and 60.7% open-ended accuracy on PERSONAMEM-V2, surpassing GPT-5-Chat’s 45.6% and 46.2% while using a 2k-token agentic memory instead of full 32k–128k contexts.

BenchmarkBenchmarkLong-Term Memory

Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

Sangyeop Kim, Yohan Lee et al.

· 2025

PREMem builds long term dialogue memory by combining Episodic Memory Extraction, Pre Storage Memory Reasoning, semantic clustering, a persistent memory pool, and an inference phase over enriched memory fragments. PREMem reaches 71.4 LLM as a judge on LongMemEval with gpt 4.1 base, a +15.5 gain over HippoRAG 2 and +9.6 over A Mem.