PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

AuthorsBowen Jiang, Yuan Yuan, Maohao Shen et al.

2025

TL;DR

PersonaMem-v2 uses RL-trained agentic memory over implicit user personas to reach 55–60% personalization accuracy with a 2k-token memory, 16× fewer tokens than full histories.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Frontier LLMs Misread Implicit Personas at 37–48% Accuracy

Frontier LLMs on PERSONAMEM-V2 reach only 37–48% accuracy on implicit personalization, with GPT-5 variants at just 40–55%.

PersonaMem-v2 targets implicit user personas in long, noisy histories, where misreading subtle preferences leads to misaligned, non-resonant personalized responses.

HOW IT WORKS

PersonaMem-v2 — RL with Agentic Memory over Implicit Personas

PersonaMem-v2 centers on PERSONAMEM-V2: IMPLICIT PERSONAS, RL with Long-Context Reasoning, and RL with Agentic Memory to train Qwen3-4B with GRPO on implicit user preferences.

You can think of PersonaMem-v2 as a brain with a hippocampus-like agentic memory and a long-term diary, compressing sprawling chats into a compact, readable persona.

PersonaMem-v2’s agentic memory framework lets Qwen3-4B remember evolving user signals in a 2k-token memory, something a plain context window cannot efficiently maintain over 32k–128k tokens.

DIAGRAM

Agentic Memory Update and Query Answering Flow

This diagram shows how PersonaMem-v2 chunks long histories, updates agentic memory causally, and then answers in-situ queries using only the final memory.

DIAGRAM

PERSONAMEM-V2 Data and Evaluation Pipeline

This diagram shows how PersonaMem-v2 builds personas, conversations, and Q&A, then evaluates Qwen3-4B and GPT-5 with MCQ and open-ended scoring.

PROCESS

How PersonaMem-v2 Handles a Multi Session Personalization Query

01
Multi-Session Realistic Conversation Histories
PersonaMem-v2 uses Multi-Session Realistic Conversation Histories to turn 20000 user preferences into multi turn chats spanning up to 128000 tokens per context.
02
In-Situ User Queries
PersonaMem-v2 appends In-Situ User Queries as MCQ and open-ended prompts at the end of histories, ensuring answers require implicit persona understanding.
03
RL with Long-Context Reasoning
PersonaMem-v2 trains Qwen3-4B with RL with Long-Context Reasoning using GRPO, mixing 80 percent MCQ and 20 percent open-ended samples for verifiable rewards.
04
RL with Agentic Memory
PersonaMem-v2 applies RL with Agentic Memory to update a 2048 token memory across up to 8 chunks, then answers queries using only this human readable memory.

KEY CONTRIBUTIONS

Key Contributions

01
PERSONAMEM-V2: IMPLICIT PERSONAS
PersonaMem-v2 introduces PERSONAMEM-V2: IMPLICIT PERSONAS with 1000 personas, 26000 preferences, 335 topics, and context windows up to 128000 tokens for realistic personalization.
02
RL with Long-Context Reasoning
PersonaMem-v2 uses RL with Long-Context Reasoning and GRPO on 18000 training samples, enabling Qwen3-4B-GRPO to reach 53.8 percent MCQ and 56.0 percent open-ended accuracy.
03
Agentic Memory Framework
PersonaMem-v2 proposes an Agentic Memory Framework that maintains a 2048 token human readable memory, achieving 55.2 percent MCQ and 60.7 percent open-ended accuracy with 16 times fewer tokens.

RESULTS

By the Numbers

MCQ accuracy

55.2%

+9.6 over GPT-5-Chat

Open-Ended accuracy

60.7%

+14.5 over GPT-5-Chat

Context tokens

2048 tokens

16× fewer than 32000 token histories

Frontier LLM range

40–55%

frontier GPT-5 variants on implicit personalization

PersonaMem-v2 is evaluated on the PERSONAMEM-V2 benchmark, which tests implicit personalization over long, noisy histories using MCQ and open-ended queries. The main result shows PersonaMem-v2’s agentic memory Qwen3-4B surpasses GPT-5-Chat while using a compact 2048 token memory instead of full 32000 token histories.

BENCHMARK

By the Numbers

BENCHMARK

Performance of Qwen3-4B and GPT-5 on PERSONAMEM-V2

MCQ accuracy on PERSONAMEM-V2 implicit personalization benchmark.

KEY INSIGHT

The Counterintuitive Finding

PersonaMem-v2 observes no significant accuracy gain when shortening context from 128k to 32k tokens by removing irrelevant conversations.

This is surprising because many assume longer context is the bottleneck, but PersonaMem-v2 shows reasoning, not context length, limits implicit personalization.

WHY IT MATTERS

What this unlocks for the field

PersonaMem-v2 unlocks compact, human readable agentic memory that tracks evolving implicit personas with only 2048 tokens per user.

Builders can now deploy scalable, low latency personalized agents that remember what matters over months of interaction without streaming entire 32000 to 128000 token histories.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…