BenchPreS: A Benchmark for Context-Aware Personalized Preference Selectivity of Persistent-Memory LLMs

AuthorsSangyeon Yoon, Sunkyoung Kim, Hyesoo Hong et al.

2026

TL;DR

BenchPreS uses context–profile combinations plus Misapplication Rate and Appropriate Application Rate to show GPT-5.2 still misapplies preferences in 40.95% of cases.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Persistent-memory LLMs over-apply user preferences, with Misapplication Rate reaching 86.48%

BenchPreS shows Misapplication Rate (MR) reaches as high as 86.48%, while GPT-5.2 still misapplies preferences in 40.95% of cases.

In third-party communication tasks like formal emails and legal letters, this causes inappropriate personalization that conflicts with social norms and institutional expectations.

HOW IT WORKS

BenchPreS — Context-aware preference selectivity benchmark

BenchPreS introduces Contexts, User Profiles, Preference Attributes, Gold Labeling, and an LLM-as-Judge framework to evaluate when preferences should be applied or suppressed.

Think of BenchPreS like a social rules engine layered over persistent memory, deciding which stored preferences are like formal attire versus casual clothes for each situation.

This setup lets BenchPreS probe behaviors that a plain context window cannot, revealing whether persistent-memory systems treat preferences as context-dependent signals instead of global rules.

DIAGRAM

Context–memory interaction in BenchPreS

This diagram shows how BenchPreS pairs contexts and user profiles, then evaluates preference application with an LLM-as-Judge.

DIAGRAM

BenchPreS evaluation pipeline

This diagram shows how BenchPreS constructs instances and computes Misapplication Rate and Appropriate Application Rate.

PROCESS

How BenchPreS Handles a Context–Aware Evaluation Session

01
Problem Formulation
BenchPreS defines communication Contexts as recipient–task pairs and associates each user in User Profiles with Preference Attributes for personalization.
02
Data Construction
BenchPreS builds User Profiles with around 152 attributes and 5 Preference Attributes, and pairs them with 39 Contexts across five domains.
03
Gold Labeling
BenchPreS uses Gold Labeling to assign g(t, a) for each preference, keeping only cases where applicability is clear and socially unambiguous.
04
Evaluation Protocols
BenchPreS runs the LLM-as-Judge framework over 1,950 instances to compute Misapplication Rate and Appropriate Application Rate from generated responses.

KEY CONTRIBUTIONS

Key Contributions

01
BenchPreS benchmark for preference selectivity
BenchPreS formalizes context-aware preference selectivity using Contexts, User Profiles, and Preference Attributes, yielding 1,950 attribute-level evaluation instances across 39 contexts and 10 users.
02
Misapplication Rate and Appropriate Application Rate
BenchPreS introduces Misapplication Rate (MR) and Appropriate Application Rate (AAR) to jointly capture over-application and correct application of preferences.
03
Analysis of reasoning and prompt-based defenses
BenchPreS uses the LLM-as-Judge framework to show reasoning variants increase both MR and AAR, while mitigation prompts reduce MR but leave residual misapplication.

RESULTS

By the Numbers

MR ↓

40.95%

-45.53 pp vs Gemini 3 Pro

AAR ↑

87.33%

-1.36 pp vs Gemini 3 Pro

AAR - MR ↑

46.38

+44.17 over Gemini 3 Pro

Models evaluated

Frontier and open models on BenchPreS

BenchPreS evaluates 10 LLMs on Misapplication Rate and Appropriate Application Rate over 1,950 instances. BenchPreS shows GPT-5.2 achieves the largest AAR − MR gap of 46.38 while still misapplying preferences in 40.95% of cases.

BENCHMARK

By the Numbers

BENCHMARK

Quantitative Results across 10 frontier LLMs

Misapplication Rate (MR) on BenchPreS.

KEY INSIGHT

The Counterintuitive Finding

BenchPreS finds Gemini 3 Pro reaches an 88.69% Appropriate Application Rate but also an 86.48% Misapplication Rate, lying near the y = x line.

This is surprising because stronger preference adherence was expected to improve selectivity, yet BenchPreS shows it mainly scales preference application globally instead of suppressing inappropriate preferences.

WHY IT MATTERS

What this unlocks for the field

BenchPreS gives researchers a concrete way to measure when persistent-memory systems should ignore user preferences instead of blindly following them.

With BenchPreS, builders can design training and prompting schemes that explicitly target low Misapplication Rate and high Appropriate Application Rate in real agent deployments.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…