Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

AuthorsHuichi Zhou, Yihang Chen, Siyuan Guo et al.

arXiv 20252025

TL;DR

Memento uses a learned case-selection policy over an episodic Case Bank to fine-tune LLM agents without weight updates, reaching 87.88% Pass@3 on GAIA validation.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents cannot adapt online without costly fine tuning

Existing LLM agents either rely on static, handcrafted workflows or require expensive gradient updates of LLM parameters, blocking low cost continual adaptation.

For deep research tasks like GAIA and DeepResearcher, this means deployed agents stay static and cannot learn from successes or failures in changing environments, limiting reliability.

HOW IT WORKS

Memento — Memory based MDP with case based reasoning

Memento’s core mechanism is a Planner, Executor, Case Memory, Subtask Memory, and Tool Memory connected via a memory based MDP and a neural case selection policy.

You can think of Case Memory like a hippocampus backed card catalog: each solved task becomes an indexed episode that Memento can recall and reuse for similar future problems.

This design lets Memento adapt behaviour online through case retrieval and Q learning over the Case Bank, instead of relying on a fixed context window or updating LLM weights.

DIAGRAM

Case based planning and execution loop in Memento

This diagram shows how Memento alternates between Case Based Planning and Tool Based Execution while reading and writing episodic cases.

DIAGRAM

Training loop for the case selection policy in Memento

This diagram shows how Memento uses soft Q learning and episodic memory to update the case retrieval policy without changing LLM weights.

PROCESS

How Memento Handles a Deep Research Task

01
Stage 1 Case Based Planning
The Planner queries Case Memory for K similar cases from the Case Bank and uses them with GPT 4.1 to generate a decomposed plan.
02
Stage 2 Tool Based Execution
The Executor reads subtasks from Subtask Memory, consults Tool Memory, and calls MCP tools like search, crawl, and code to complete each subtask.
03
Case Memory Management
After task completion, Memento uses Write to append the final state, action plan, and reward into Case Memory, updating the Case Bank online.
04
Soft Q Learning for CBR Agent
Memento samples transitions from the replay buffer and episodic memory to train the Q function or kernel network that drives the case retrieval policy.

KEY CONTRIBUTIONS

Key Contributions

01
Memory Based Markov Decision Process
Memento formalises deep research agents as a Memory Based MDP with Case Memory, Subtask Memory, and Tool Memory guiding a Planner–Executor architecture.
02
Neural Case Selection Policy
Memento learns a soft Q learning based case retrieval policy over the Case Bank, using a kernel network or neural Q function without updating LLM parameters.
03
Deep Research Agent Memento
Memento achieves 87.88% Pass@3 on GAIA validation, 79.40% on GAIA test, and 66.6% F1 and 80.4% PM on DeepResearcher, adding 4.7–9.6 points on OOD tasks.

RESULTS

By the Numbers

GAIA validation Pass@3

87.88%

+0.61 over Alita on GAIA validation average score (87.88 vs 87.27)

GAIA test accuracy

79.40%

+2.32 over Aworld on GAIA test average score (79.40 vs 77.08)

DeepResearcher Avg F1

66.6%

+14.8 over DeepResearcher on DeepResearcher Avg F1 (66.6 vs 51.8)

DeepResearcher Avg PM

80.4%

+19.9 over DeepResearcher on DeepResearcher Avg PM (80.4 vs 60.5)

On GAIA, a long horizon tool use benchmark, and DeepResearcher’s seven open domain QA datasets, these results show that Memento’s case based continual learning can match or exceed training based agents without LLM fine tuning.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison on DeepResearcher benchmarks

Average F1 across seven open domain QA datasets from the DeepResearcher benchmark.

BENCHMARK

Effect of case based reasoning on SimpleQA

Accuracy on SimpleQA factual QA with and without case based reasoning.

KEY INSIGHT

The Counterintuitive Finding

Memento with a fast GPT 4.1 planner and o3 executor reaches 70.91% Pass@1 on GAIA validation, beating a slow o3 planner at 63.03%.

This is surprising because slower, more deliberative planners are expected to reason better, yet Memento shows concise planning plus specialised execution works better than overthinking.

WHY IT MATTERS

What this unlocks for the field

Memento unlocks continual, online improvement of LLM agents via case based reinforcement learning over external memory, without any gradient updates to LLM weights.

Builders can now deploy deep research agents that learn from every trajectory in real time, scaling to open ended tasks without retraining or rebuilding prompts from scratch.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…