Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

AuthorsZhenting Wang, Huancheng Chen, Jiayun Wang, Wei Wei

2026

TL;DR

Memex(RL) trains Memex’s Indexed Experience Memory to compress trajectories into indexed summaries and boosts task success from 24.22% to 85.61% while shrinking peak context.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents overflow context and lose evidence

Memex(RL) targets long-horizon agents where prompts grow to 16,934.46 tokens, far beyond the 8,000-token context penalty threshold, degrading decisions.

In modified ALFWorld, Memex(RL) shows that without structured memory, tool-heavy workflows either exceed context budgets or rely on lossy summaries that drop crucial logs, IDs, and tool outputs.

HOW IT WORKS

Memex and Indexed Experience Memory

Memex(RL) centers on Indexed Experience Memory, combining IndexedSummary, CompressExperience, ReadExperience, and an external experience store D with ContextStatus-driven control.

Think of Memex(RL) as RAM plus disk: the indexed summary is fast working memory, while the key–value store D is a long-term archive addressed by stable indices.

This design lets Memex(RL) keep a compact, pointer-heavy context yet deterministically dereference exact past artifacts, something a plain context window or fuzzy semantic retrieval cannot guarantee.

DIAGRAM

Memex agent loop with indexed compression and retrieval

This diagram shows how Memex(RL) runs the Memex agent loop from context monitoring through compression, retrieval, tool calls, and final Finish.

DIAGRAM

MemexRL training and evaluation pipeline

This diagram shows how Memex(RL) samples rollouts, computes memory-aware rewards, updates the policy, and evaluates Memex on modified ALFWorld.

PROCESS

How Memex(RL) Handles a Long-Horizon Tool-Use Episode

  1. 01

    Memex Agent Loop Initialization

    Memex(RL) initializes M = [m0, u], empties the external experience store D, and sets the answer placeholder before any memory operations.

  2. 02

    ContextStatus and Tool Decisions

    At each step, Memex(RL) appends ContextStatus(M, τ), then the policy πagent emits thinking zt and a tool call ct, including CompressExperience, ReadExperience, or Finish.

  3. 03

    CompressExperience Operation

    When ct is CompressExperience, Memex(RL) writes each (index, content) pair into D and rewrites the working context to [m0, u, IndexedSummary] to shrink the prompt.

  4. 04

    ReadExperience and Finish

    When ct is ReadExperience, Memex(RL) dereferences D[index] and appends it to context, and when ct is Finish(y), Memex(RL) returns the final answer y.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Indexed Experience Memory

    Memex(RL) formalizes Indexed Experience Memory as an in-context IndexedSummary plus an external experience store D, enabling explicit dereferencing instead of lossy summaries.

  • 02

    MemexRL reinforcement learning framework

    Memex(RL) introduces a GRPO-style RL framework with context overflow, redundant tool call, and format penalties, and segmented trajectories aligned with CompressExperience boundaries.

  • 03

    Theoretical and empirical analysis of Memex loop

    Memex(RL) proves that bounded dereferencing can match full-context optimal policies and empirically raises ALFWorld success from 24.22% to 85.61% while cutting peak context by about 6,300 tokens.

RESULTS

By the Numbers

Task success rate

85.61%

+61.39 points over Memex without RL (24.22%)

Peak working context

9,634.47 tokens

-6,299.99 tokens vs Memex without RL (16,934.46)

Context threshold

8,000 tokens

Penalty threshold used during Memex(RL) training

Context window size

32,000 tokens

Total context window available to Memex(RL) during training

On a modified ALFWorld benchmark with hidden admissible commands and truncated summaries, Memex(RL) shows that Indexed Experience Memory can dramatically raise task success while keeping peak working context near the 8,000-token penalty threshold.

BENCHMARK

By the Numbers

On a modified ALFWorld benchmark with hidden admissible commands and truncated summaries, Memex(RL) shows that Indexed Experience Memory can dramatically raise task success while keeping peak working context near the 8,000-token penalty threshold.

BENCHMARK

Effectiveness of MemexRL on Modified ALFWorld

Task success rate (%) for Memex(RL) versus the same Memex agent without RL training.

KEY INSIGHT

The Counterintuitive Finding

Memex(RL) reduces the mean CompressExperience calls per episode from about 6.5 to about 3, while ReadExperience calls rise from about 1 to around 6–7.

This is surprising because one might expect more compression to save context, but Memex(RL) instead learns to compress less often and rely on precise retrieval from Indexed Experience Memory.

WHY IT MATTERS

What this unlocks for the field

Memex(RL) unlocks long-horizon LLM agents that keep a small, indexed working state while still accessing exact historical tool outputs and code snippets on demand.

Builders can now design agents that scale to dozens or hundreds of steps under tight context budgets without sacrificing decision quality or relying solely on lossy summarization.

~12 min read← Back to papers

Related papers

Memory Architecture

Breaking the KV Cache Bottleneck: Fan Duality Model Achieves O(1) Decode Memory with Superior Associative Recall

Yasong Fan

· 2026

Fan Duality Model (FDM) uses the Fan Operator, Local-Global Cache, Freeze-Scan Training, and Holographic Reference Beam Decoding to separate wave-like compression from particle-like associative recall. On WikiText-103, Fan Duality Model (FDM) reaches 64.9 perplexity with Freeze-Scan and 62.79 with holographic decoding, while achieving 0.966 MQAR accuracy compared to Transformer at 0.606.

Questions about this paper?

Paper: Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Answers use this explainer on Memory Papers.

Checking…