AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
Jianfei Xiao, Xiang Yu et al.
· 2026
AlpsBench combines Personalized Information Extraction, Personalized Information Update, Personalized Information Retrieval, and Personalized Information Utilization over 2,500 WildChat dialogues with human-verified structured memories. AlpsBench shows, for example, that Gemini-3 Flash scores 51.67 on Task 1 Extraction while DeepSeek Reasoner reaches 0.9569 retrieval recall with 100 distractors on AlpsBench.