AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

AuthorsJianfei Xiao, Xiang Yu, Chengbing Wang et al.

2026

TL;DR

AlpsBench uses a four-task real-dialogue pipeline with human-verified structured memories to expose severe personalization gaps, e.g., Gemini-3 Flash reaches only 51.67 on extraction and 0.6895 PA.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Personalization benchmarks ignore real dialogue and full memory lifecycle

Existing personalization benchmarks either skip personalized information management or rely on synthetic dialogues, creating a distribution gap from real-world conversations.

This gap means LLMs are not tested on implicit preferences or realistic long-term interactions, so deployed assistants mishandle memory extraction, updating, retrieval, and preference alignment.

HOW IT WORKS

AlpsBench — four-task real-dialogue personalization benchmark

AlpsBench introduces four coordinated tasks: Personalized Information Extraction, Personalized Information Update, Personalized Information Retrieval, and Personalized Information Utilization over human–LLM dialogues with structured memories.

You can think of AlpsBench like a lab test for an assistant’s memory system, checking how it writes, edits, looks up, and uses a personal card catalog of user traits.

This design lets AlpsBench probe capabilities that a plain context window cannot, such as conflict resolution in updates and distinguishing role-play from real-world persona in utilization.

DIAGRAM

AlpsBench evaluation tasks over a single user session

This diagram shows how AlpsBench runs the four evaluation tasks on a user’s dialogue history and structured memories.

DIAGRAM

AlpsBench four-step curation pipeline

This diagram shows how AlpsBench constructs the benchmark from WildChat logs through extraction, human annotation, and task construction.

PROCESS

How AlpsBench Handles a Personalization Evaluation Lifecycle

  1. 01

    Task1: Personalized Information Extraction

    AlpsBench feeds raw dialogue history into Personalized Information Extraction to produce structured memories with type, label, value, and confidence fields.

  2. 02

    Task2: Personalized Information Update

    AlpsBench combines existing memories with new dialogue in Personalized Information Update, labeling each change as Retention, Addition, or Modification.

  3. 03

    Task3: Personalized Information Retrieval

    AlpsBench passes a query plus one positive and many negative memories into Personalized Information Retrieval to test recall under distractors.

  4. 04

    Task4: Personalized Information Utilization

    AlpsBench evaluates responses in Personalized Information Utilization along Persona Awareness, Preference Following, Virtual-Reality Awareness, Constraint Following, and Emotional Intelligence.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    AlpsBench benchmark introduction

    AlpsBench introduces a real-dialogue personalization benchmark with 2,500 long-term WildChat interaction sequences, each between 6 and 249 turns, plus human-verified structured memories.

  • 02

    Four-task evaluation framework

    AlpsBench defines Personalized Information Extraction, Update, Retrieval, and Utilization, covering memory extraction, manipulation actions, retrieval recall, and multi-dimensional alignment scores.

  • 03

    Comprehensive analysis of frontier LLMs

    AlpsBench benchmarks GPT-5.2, DeepSeek Reasoner, Gemini-3 Flash, Qwen3-max, Claude-Sonnet-4.5, Llama-4 Maverick, GPT-4.1-mini and multiple memory systems, revealing ceilings like 51.67 extraction F1 and 0.7542 retrieval recall at 1000 distractors.

RESULTS

By the Numbers

Task 1 Extraction

51.67

+29.60 over Llama-4 Maverick

Task 3 Retrieval (100 distractors)

0.9569

+0.0767 over GPT-4.1-mini

Task 2 Update

81.49

+30.24 over Claude-Sonnet-4.5

Persona Awareness (PA)

0.7246

+0.3228 over GPT-4.1-mini

On AlpsBench, Task 1 tests extraction F1, Task 2 tests update accuracy, Task 3 tests retrieval recall under distractors, and Task 4 tests utilization dimensions. These results show AlpsBench can separate systems that retrieve well, like DeepSeek Reasoner at 0.9569 recall, from those that still fail at extraction and persona awareness.

BENCHMARK

By the Numbers

On AlpsBench, Task 1 tests extraction F1, Task 2 tests update accuracy, Task 3 tests retrieval recall under distractors, and Task 4 tests utilization dimensions. These results show AlpsBench can separate systems that retrieve well, like DeepSeek Reasoner at 0.9569 recall, from those that still fail at extraction and persona awareness.

BENCHMARK

Tasks 1–4. Experimental evaluation result of general-purpose LLMs (Task 1 Extraction)

Task 1 Extraction scores on AlpsBench for selected general-purpose LLMs.

KEY INSIGHT

The Counterintuitive Finding

Even with explicit memory mechanisms, EverMemOS reaches only 2.68 Emotional Intelligence in English, below GPT-4.1-mini’s 2.79 without external memory.

This is surprising because adding more memory is expected to help empathy, but AlpsBench shows current designs bias toward task completion over emotionally resonant interaction.

WHY IT MATTERS

What this unlocks for the field

AlpsBench gives researchers a unified, real-dialogue benchmark to stress-test extraction, updating, retrieval, and utilization under implicit preferences and noisy histories.

Builders can now compare memory systems apples-to-apples, identify whether failures come from missing traits, bad updates, weak retrieval, or misaligned responses, and target the right layer.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Long-Term Memory

A-MBER: Affective Memory Benchmark for Emotion Recognition

Deliang Wen, Ke Sun, Yu Wang

· 2026

A-MBER builds multi-session conversational scenarios via a staged pipeline of persona specification, long-horizon planning, conversation generation, annotation, question construction, and benchmark-unit packaging. On A-MBER, a structured memory system reaches 0.69 judgment accuracy, 0.66 retrieval, and 0.65 explanation versus 0.34, 0.29, and 0.31 for a no-memory baseline.

Long-Term Memory

BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Sangyeon Yoon, Sunkyoung Kim et al.

· 2026

BenchPreS combines Contexts, User Profiles, Preference Attributes, Gold Labeling, and an LLM-as-Judge framework to test context-aware preference selectivity in persistent-memory LLMs. BenchPreS shows GPT-5.2 reaches 87.33% Appropriate Application Rate on BenchPreS while still having a 40.95% Misapplication Rate compared to Gemini 3 Pro’s 86.48% Misapplication Rate.

Questions about this paper?

Paper: AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Answers use this explainer on Memory Papers.

Checking…