BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

AuthorsSangyeon Yoon, Sunkyoung Kim, Hyesoo Hong et al.

2026

TL;DR

BenchPreS uses context–profile combinations plus Misapplication Rate and Appropriate Application Rate to show GPT-5.2 still misapplies preferences in 40.95% of cases.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Persistent-memory LLMs over-apply user preferences, with Misapplication Rate reaching 86.48%

BenchPreS shows Misapplication Rate (MR) reaches as high as 86.48%, while GPT-5.2 still misapplies preferences in 40.95% of cases.

In third-party communication tasks like formal emails and legal letters, this causes inappropriate personalization that conflicts with social norms and institutional expectations.

HOW IT WORKS

BenchPreS — Context-aware preference selectivity benchmark

BenchPreS introduces Contexts, User Profiles, Preference Attributes, Gold Labeling, and an LLM-as-Judge framework to evaluate when preferences should be applied or suppressed.

Think of BenchPreS like a social rules engine layered over persistent memory, deciding which stored preferences are like formal attire versus casual clothes for each situation.

This setup lets BenchPreS probe behaviors that a plain context window cannot, revealing whether persistent-memory systems treat preferences as context-dependent signals instead of global rules.

DIAGRAM

Context–memory interaction in BenchPreS

This diagram shows how BenchPreS pairs contexts and user profiles, then evaluates preference application with an LLM-as-Judge.

DIAGRAM

BenchPreS evaluation pipeline

This diagram shows how BenchPreS constructs instances and computes Misapplication Rate and Appropriate Application Rate.

PROCESS

How BenchPreS Handles a Context–Aware Evaluation Session

  1. 01

    Problem Formulation

    BenchPreS defines communication Contexts as recipient–task pairs and associates each user in User Profiles with Preference Attributes for personalization.

  2. 02

    Data Construction

    BenchPreS builds User Profiles with around 152 attributes and 5 Preference Attributes, and pairs them with 39 Contexts across five domains.

  3. 03

    Gold Labeling

    BenchPreS uses Gold Labeling to assign g(t, a) for each preference, keeping only cases where applicability is clear and socially unambiguous.

  4. 04

    Evaluation Protocols

    BenchPreS runs the LLM-as-Judge framework over 1,950 instances to compute Misapplication Rate and Appropriate Application Rate from generated responses.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    BenchPreS benchmark for preference selectivity

    BenchPreS formalizes context-aware preference selectivity using Contexts, User Profiles, and Preference Attributes, yielding 1,950 attribute-level evaluation instances across 39 contexts and 10 users.

  • 02

    Misapplication Rate and Appropriate Application Rate

    BenchPreS introduces Misapplication Rate (MR) and Appropriate Application Rate (AAR) to jointly capture over-application and correct application of preferences.

  • 03

    Analysis of reasoning and prompt-based defenses

    BenchPreS uses the LLM-as-Judge framework to show reasoning variants increase both MR and AAR, while mitigation prompts reduce MR but leave residual misapplication.

RESULTS

By the Numbers

MR ↓

40.95%

-45.53 pp vs Gemini 3 Pro

AAR ↑

87.33%

-1.36 pp vs Gemini 3 Pro

AAR - MR ↑

46.38

+44.17 over Gemini 3 Pro

Models evaluated

10

Frontier and open models on BenchPreS

BenchPreS evaluates 10 LLMs on Misapplication Rate and Appropriate Application Rate over 1,950 instances. BenchPreS shows GPT-5.2 achieves the largest AAR − MR gap of 46.38 while still misapplying preferences in 40.95% of cases.

BENCHMARK

By the Numbers

BenchPreS evaluates 10 LLMs on Misapplication Rate and Appropriate Application Rate over 1,950 instances. BenchPreS shows GPT-5.2 achieves the largest AAR − MR gap of 46.38 while still misapplying preferences in 40.95% of cases.

BENCHMARK

Quantitative Results across 10 frontier LLMs

Misapplication Rate (MR) on BenchPreS.

KEY INSIGHT

The Counterintuitive Finding

BenchPreS finds Gemini 3 Pro reaches an 88.69% Appropriate Application Rate but also an 86.48% Misapplication Rate, lying near the y = x line.

This is surprising because stronger preference adherence was expected to improve selectivity, yet BenchPreS shows it mainly scales preference application globally instead of suppressing inappropriate preferences.

WHY IT MATTERS

What this unlocks for the field

BenchPreS gives researchers a concrete way to measure when persistent-memory systems should ignore user preferences instead of blindly following them.

With BenchPreS, builders can design training and prompting schemes that explicitly target low Misapplication Rate and high Appropriate Application Rate in real agent deployments.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Long-Term Memory

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao, Xiang Yu et al.

· 2026

AlpsBench combines Personalized Information Extraction, Personalized Information Update, Personalized Information Retrieval, and Personalized Information Utilization over 2,500 WildChat dialogues with human-verified structured memories. AlpsBench shows, for example, that Gemini-3 Flash scores 51.67 on Task 1 Extraction while DeepSeek Reasoner reaches 0.9569 retrieval recall with 100 distractors on AlpsBench.

Questions about this paper?

Paper: BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

Answers use this explainer on Memory Papers.

Checking…