MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

AuthorsShengtao Zhang, Jiaqian Wang, Ruiwen Zhou et al.

2026

TL;DR

MemRL uses Two-Phase Retrieval with runtime Q-value updates on an Intent-Experience-Utility memory to reach 0.979 success on ALFWorld vs 0.921 for MemP (+0.058).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Runtime agents stuck with passive retrieval and catastrophic forgetting

MemRL targets agents where fine-tuning is computationally expensive and prone to catastrophic forgetting, while memory methods rely on passive semantic matching that often retrieves noise.

In these settings, frozen-weights systems cannot exploit runtime feedback, so they repeatedly reuse unhelpful experiences and fail to achieve continuous post-deployment improvement.

HOW IT WORKS

MemRL — Intent-Experience-Utility memory with Two-Phase Retrieval

MemRL centers on a structured Intent-Experience-Utility memory bank, a Two-Phase Retrieval mechanism, and a Runtime Utility Update rule that learns Q-values for each memory item.

You can think of MemRL like a brain with a stable cortex LLM and a plastic hippocampus that tags each episode with a learned usefulness score.

This design lets MemRL selectively reuse high-utility experiences, enabling self-evolution and stable improvement that a plain context window or static RAG cannot provide.

DIAGRAM

Runtime interaction loop between user, MemRL memory, and environment

This diagram shows how MemRL interacts with the environment over time, retrieving experiences, generating actions, and updating utilities from rewards.

DIAGRAM

Evaluation pipeline and ablation design for MemRL

This diagram shows how MemRL is evaluated across benchmarks and ablations, from datasets through baselines to runtime and transfer metrics.

PROCESS

How MemRL Handles a Runtime Continuous Learning Session

01
Intent Experience Utility Triplet
MemRL encodes the current user intent and stores it with an experience trace and a utility Q in the Intent-Experience-Utility triplet memory.
02
Two Phase Retrieval
MemRL first recalls candidates by similarity, then re ranks them with learned Q values to form a value aware context for the frozen LLM.
03
Runtime Utility Update
After the environment returns a reward, MemRL applies a Monte Carlo style update to adjust Q for each used experience.
04
Runtime Learning Loop
MemRL repeats retrieval, generation, and Q updates across tasks, enabling Runtime Continuous Learning without any backbone weight updates.

KEY CONTRIBUTIONS

Key Contributions

01
Runtime learning with Model Memory decoupling
MemRL introduces a runtime learning framework that decouples stable reasoning from plastic memory using the Intent-Experience-Utility triplet, reconciling the stability plasticity dilemma without tuning weights.
02
Two Phase Retrieval and Utility Driven Update
MemRL combines Two-Phase Retrieval with a Utility-Driven Update rule so retrieval is guided by learned Q values rather than pure semantic similarity.
03
Stability analysis and empirical gains
MemRL proves unbiased, variance bounded Q estimates and achieves an average +0.028 transfer success gain over MemP, reaching 0.979 on ALFWorld exploration.

RESULTS

By the Numbers

Exploration Success Rate

0.979

+0.058 over MemP on ALFWorld exploration transfer

OS Task Success Rate

0.746

+0.026 over MemP in Lifelong Agent Bench OS transfer

Average Runtime CSR

0.798

+0.038 CSR over MemP across four benchmarks

Q Success Correlation

0.861

Pearson r between Q estimate bins and downstream task success rate

These metrics come from BigCodeBench, Lifelong Agent Bench, ALFWorld, and HLE, which test code generation, OS DB tasks, exploration, and knowledge frontier reasoning. The gains show that MemRL converts runtime rewards into better retrieval policies, improving both adaptation and transfer under frozen backbones.

BENCHMARK

By the Numbers

BENCHMARK

Transfer Learning results on BigCodeBench, Lifelong Agent Bench and ALFWorld

Success Rate comparison for MemRL and baselines on ALFWorld exploration transfer.

KEY INSIGHT

The Counterintuitive Finding

MemRL’s learned Q critic keeps around 12% failure memories even in the highest Q bin, where downstream success reaches 88.1%.

This is surprising because you might expect only successful episodes to be high value, but MemRL shows that near miss failures can still provide strategically useful guidance.

WHY IT MATTERS

What this unlocks for the field

MemRL unlocks frozen backbone agents that still self evolve online by learning which experiences to retrieve via reinforcement learning on memory.

Builders can now deploy long lived agents that improve across BigCodeBench, ALFWorld, and OS tasks without any weight updates, avoiding catastrophic forgetting and heavy fine tuning pipelines.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

arXiv:2601.20540 Read explainer

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

arXiv:2604.12179 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…