Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

AuthorsYuxuan Cai, Jie Zhou, Qin Chen, Liang He

2026

TL;DR

PROACTAGENT uses PROACTRL paired-branch process rewards over a structured EXPERIENCE BASE to reach 73.50% SR on SciWorld, +18.0 points over GRPO+Reflexion.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Lifelong agents rely on passive retrieval and stall at 55.50% SR

Existing lifelong agents rely on static initialization or continuous retrieval, leaving PROACTAGENT’s baselines like GRPO+Reflexion capped at 55.50% SR on SciWorld.

This passive control causes redundant retrieval, context overload, and missed knowledge gaps, so agents in SciWorld, AlfWorld, and StuLife waste steps and still fail long-horizon tasks.

HOW IT WORKS

PROACTAGENT — Experience-Enhanced Online Evolution with PROACTRL

PROACTAGENT centers on Experience-Enhanced Online Evolution (EXPONEVO), a structured EXPERIENCE BASE, and Proactive Reinforcement Learning-based Retrieval (PROACTRL) that treat retrieval as a learnable policy action.

Think of the EXPERIENCE BASE as a typed card catalog and EXPONEVO as a librarian that continuously re-shelves, annotates, and reprioritizes cards based on how useful they were during recent tasks.

This design lets PROACTAGENT selectively retrieve factual memory, episodic memory, and behavioral skills exactly when needed, instead of relying on a fixed context window or always-on retrieval.

DIAGRAM

Paired-Branch Retrieval Control with PROACTRL

This diagram shows how PROACTAGENT uses PROACTRL to branch trajectories with and without retrieval and compute process rewards for proactive control.

DIAGRAM

Online Evaluation and Ablation Pipeline for PROACTAGENT

This diagram shows how PROACTAGENT is evaluated across SciWorld, AlfWorld, and StuLife with ablations over EXPONEVO, PROACTRL, and the EXPERIENCE BASE.

PROCESS

How PROACTAGENT Handles an Online Lifelong Task

01
Experience-Enhanced Online Evolution
EXPONEVO runs as PROACTAGENT interacts with tasks, collecting trajectories and closing the loop between acting, experience accumulation, and policy optimization.
02
Structured EXPERIENCE BASE Construction
PROACTAGENT organizes new trajectories into Mf, Me, S+, S−, and S∆, storing factual memory, episodic memory, and behavioral skills with embeddings and priority scores.
03
Proactive Reinforcement Learning-based Retrieval
PROACTRL treats retrieval as an explicit action, branching trajectories with and without retrieval to compute process rewards that supervise when and what PROACTAGENT retrieves.
04
GRPO Optimization with Process Rewards
PROACTAGENT optimizes its policy using the GRPO objective, combining environment rewards, PROACTRL process rewards, and efficiency penalties for repeated or long trajectories.

KEY CONTRIBUTIONS

Key Contributions

01
PROACTAGENT Framework
PROACTAGENT unifies Experience-Enhanced Online Evolution (EXPONEVO) with a structured EXPERIENCE BASE so that memory and policy co-evolve from interaction history, reaching 73.50% SR on SciWorld.
02
Proactive Reinforcement Learning-based Retrieval
PROACTRL models retrieval as a policy action and uses paired-branch process rewards to give step-level supervision, adding or subtracting α based on rollout margins ∆i.
03
Typed EXPERIENCE BASE with Skills and Memories
PROACTAGENT partitions experience into Mf, Me, S+, S−, and S∆, enabling a single query to return complementary factual evidence and behavioral guidance across SciWorld, AlfWorld, and StuLife.

RESULTS

By the Numbers

SciWorld SR

73.50%

+18.00 over GRPO+Reflexion

SciWorld Rounds

18.38

-9.14 rounds vs GRPO+Reflexion

AlfWorld SR

71.28%

+4.10 over GRPO+Reflexion

StuLife StuGPA

19.26

+5.75 over GRPO+Reflexion

On SciWorld, AlfWorld, and StuLife, which test long-horizon scientific, embodied, and student-life tasks, PROACTAGENT shows that proactive retrieval plus online co-evolution yields higher success rates and shorter trajectories than GRPO+Reflexion and other baselines.

BENCHMARK

By the Numbers

BENCHMARK

Main results across three lifelong agent benchmarks

Success Rate (SR) comparison on SciWorld for PROACTAGENT and key baselines.

BENCHMARK

Ablation study on SciWorld

Success Rate (SR) for PROACTAGENT and ablated variants on SciWorld.

KEY INSIGHT

The Counterintuitive Finding

PROACTAGENT’s 3B variant reaches 53.50% SR on SciWorld, nearly matching the 7B GRPO+Reflexion at 55.00% while using 14.35 versus 27.52 rounds.

This is surprising because it shows proactive retrieval and process rewards can offset a 4B parameter gap, contradicting the assumption that scaling capacity alone is the main path to better lifelong agents.

WHY IT MATTERS

What this unlocks for the field

PROACTAGENT gives agents a learned sense of when to consult long-term experience, using PROACTRL and the EXPERIENCE BASE instead of fixed retrieval schedules.

Builders can now deploy smaller models that still handle long-horizon SciWorld, AlfWorld, and StuLife tasks by teaching retrieval as a first-class action rather than just expanding context windows.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…