AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

AuthorsJianfei Xiao, Xiang Yu, Chengbing Wang et al.

2026

TL;DR

AlpsBench uses a four-task real-dialogue pipeline with human-verified structured memories to expose severe personalization gaps, e.g., Gemini-3 Flash reaches only 51.67 on extraction and 0.6895 PA.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Personalization benchmarks ignore real dialogue and full memory lifecycle

Existing personalization benchmarks either skip personalized information management or rely on synthetic dialogues, creating a distribution gap from real-world conversations.

This gap means LLMs are not tested on implicit preferences or realistic long-term interactions, so deployed assistants mishandle memory extraction, updating, retrieval, and preference alignment.

HOW IT WORKS

AlpsBench — four-task real-dialogue personalization benchmark

AlpsBench introduces four coordinated tasks: Personalized Information Extraction, Personalized Information Update, Personalized Information Retrieval, and Personalized Information Utilization over human–LLM dialogues with structured memories.

You can think of AlpsBench like a lab test for an assistant’s memory system, checking how it writes, edits, looks up, and uses a personal card catalog of user traits.

This design lets AlpsBench probe capabilities that a plain context window cannot, such as conflict resolution in updates and distinguishing role-play from real-world persona in utilization.

DIAGRAM

AlpsBench evaluation tasks over a single user session

This diagram shows how AlpsBench runs the four evaluation tasks on a user’s dialogue history and structured memories.

DIAGRAM

AlpsBench four-step curation pipeline

This diagram shows how AlpsBench constructs the benchmark from WildChat logs through extraction, human annotation, and task construction.

PROCESS

How AlpsBench Handles a Personalization Evaluation Lifecycle

01
Task1: Personalized Information Extraction
AlpsBench feeds raw dialogue history into Personalized Information Extraction to produce structured memories with type, label, value, and confidence fields.
02
Task2: Personalized Information Update
AlpsBench combines existing memories with new dialogue in Personalized Information Update, labeling each change as Retention, Addition, or Modification.
03
Task3: Personalized Information Retrieval
AlpsBench passes a query plus one positive and many negative memories into Personalized Information Retrieval to test recall under distractors.
04
Task4: Personalized Information Utilization
AlpsBench evaluates responses in Personalized Information Utilization along Persona Awareness, Preference Following, Virtual-Reality Awareness, Constraint Following, and Emotional Intelligence.

KEY CONTRIBUTIONS

Key Contributions

01
AlpsBench benchmark introduction
AlpsBench introduces a real-dialogue personalization benchmark with 2,500 long-term WildChat interaction sequences, each between 6 and 249 turns, plus human-verified structured memories.
02
Four-task evaluation framework
AlpsBench defines Personalized Information Extraction, Update, Retrieval, and Utilization, covering memory extraction, manipulation actions, retrieval recall, and multi-dimensional alignment scores.
03
Comprehensive analysis of frontier LLMs
AlpsBench benchmarks GPT-5.2, DeepSeek Reasoner, Gemini-3 Flash, Qwen3-max, Claude-Sonnet-4.5, Llama-4 Maverick, GPT-4.1-mini and multiple memory systems, revealing ceilings like 51.67 extraction F1 and 0.7542 retrieval recall at 1000 distractors.

RESULTS

By the Numbers

Task 1 Extraction

51.67

+29.60 over Llama-4 Maverick

Task 3 Retrieval (100 distractors)

0.9569

+0.0767 over GPT-4.1-mini

Task 2 Update

81.49

+30.24 over Claude-Sonnet-4.5

Persona Awareness (PA)

0.7246

+0.3228 over GPT-4.1-mini

On AlpsBench, Task 1 tests extraction F1, Task 2 tests update accuracy, Task 3 tests retrieval recall under distractors, and Task 4 tests utilization dimensions. These results show AlpsBench can separate systems that retrieve well, like DeepSeek Reasoner at 0.9569 recall, from those that still fail at extraction and persona awareness.

BENCHMARK

By the Numbers

BENCHMARK

Tasks 1–4. Experimental evaluation result of general-purpose LLMs (Task 1 Extraction)

Task 1 Extraction scores on AlpsBench for selected general-purpose LLMs.

KEY INSIGHT

The Counterintuitive Finding

Even with explicit memory mechanisms, EverMemOS reaches only 2.68 Emotional Intelligence in English, below GPT-4.1-mini’s 2.79 without external memory.

This is surprising because adding more memory is expected to help empathy, but AlpsBench shows current designs bias toward task completion over emotionally resonant interaction.

WHY IT MATTERS

What this unlocks for the field

AlpsBench gives researchers a unified, real-dialogue benchmark to stress-test extraction, updating, retrieval, and utilization under implicit preferences and noisy histories.

Builders can now compare memory systems apples-to-apples, identify whether failures come from missing traits, bad updates, weak retrieval, or misaligned responses, and target the right layer.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…