M+: Extending MemoryLLM with Scalable Long-Term Memory

AuthorsYu Wang, Dmitry Krotov, Yuanzhe Hu et al.

2025

TL;DR

M+ adds a co-trained long-term latent memory on top of MemoryLLM’s short-term pool, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Latent-memory LLMs forget information beyond 20k tokens

MemoryLLM compresses history into a 1B-parameter latent memory pool but “struggles to retain knowledge beyond 20k tokens.”

This failure limits long-context tasks like LongBook-QA and knowledge retention where important facts appear over 100k tokens earlier, causing answer accuracy to collapse.

HOW IT WORKS

M+: Long-term latent memory with a co-trained retriever

M+ combines short-term memory θ, long-term memory Θ, a retriever with fq and fk, and a Multi-LoRA design on top of Llama-3.1-8B.

You can think of θ as fast RAM and Θ as slower but much larger disk, with the retriever acting like an index that pulls only relevant blocks.

This architecture lets M+ recall and reuse information injected over 160k tokens ago, far beyond what a fixed context window or plain MemoryLLM can access.

DIAGRAM

Update and generation flow across short-term and long-term memory

This diagram shows how M+ updates θ and Θ during ingestion and retrieves K0 tokens from Θ during generation at each transformer layer.

DIAGRAM

Three-stage training curriculum and ablation design

This diagram shows how M+ is trained in three stages and how MemoryLLM-8B, MemoryLLM-8B-Long, and M+ are compared in ablations.

PROCESS

How M+ Handles a Long-Context Question Answering Task

  1. 01

    Update Process

    During the Update Process, M+ feeds each chunk through ϕ, updates short-term memory θ, and drops K tokens per layer instead of discarding them.

  2. 02

    Equipping MemoryLLM with Long-Term Memory

    In Equipping MemoryLLM with Long-Term Memory, M+ stores dropped tokens into long-term memory Θ with ages and enforces a maximum size M=150k.

  3. 03

    Retriever Design and Training

    In Retriever Design and Training, M+ learns fq and fk so query hidden states can retrieve relevant Θ tokens via dot products against stored keys.

  4. 04

    Training with long-term memory

    During Training with long-term memory (Stage 3), M+ uses SlimPajama long documents so ϕ learns to interpret retrieved Θ tokens alongside θ during generation.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Equipping MemoryLLM with Long-Term Memory

    M+ augments short-term memory θ with long-term memory Θ stored on CPU, using K=256, N=10240, and M=150k to retain dropped tokens instead of discarding them.

  • 02

    Co-trained retriever for efficient memory retrieval

    M+ introduces a retriever with fq and fk that projects hidden states to dimension dproj=d/20 and retrieves only once per layer for all heads, reducing latency.

  • 03

    Three-stage long-context training curriculum

    M+ uses continual training on fineweb-edu, then SlimPajama 4k–64k, then Training with long-term memory, progressively improving long-context loss and retention beyond 160k tokens.

RESULTS

By the Numbers

LongBench Avg 16k

35.81

+3.03 over M+ (16k) on LongBench average (35.81 vs 32.78)

LongBench Avg 8k

33.39

+2.39 over M+ (8k) on LongBench average (33.39 vs 31.00)

GPU Memory Cost

21177.76 MB

-9296.73 MB vs Llama-3.1-8B-SnapKV (21177.76 vs 30422.70)

GPU Memory Offload

17973.34 MB

-1451.87 MB vs Llama-3.1-8B-16k (17973.34 vs 19239.21)

On LongBench, Llama-3.1-8B reaches 35.81 average F1 at 16k while M+ reaches 32.78, trading a 3.03 drop for much longer retention. On GPU memory, M+ uses 21177.76 MB (17973.34 MB with offloading) versus 32574.49 MB for Llama-3.1-8B-SnapKV, showing that M+ extends retention without increasing GPU footprint.

BENCHMARK

By the Numbers

On LongBench, Llama-3.1-8B reaches 35.81 average F1 at 16k while M+ reaches 32.78, trading a 3.03 drop for much longer retention. On GPU memory, M+ uses 21177.76 MB (17973.34 MB with offloading) versus 32574.49 MB for Llama-3.1-8B-SnapKV, showing that M+ extends retention without increasing GPU footprint.

BENCHMARK

GPU Memory Cost Comparison across Methods

Maximum GPU memory allocated (MB) during inference across long-context benchmarks.

KEY INSIGHT

The Counterintuitive Finding

Even with a 48k prompt, Llama-3.1-8B-SnapKV “struggles to recall information injected more than 30k tokens earlier,” while M+ retains beyond 160k.

This is surprising because larger key value caches were expected to help, but M+ shows that explicitly stored latent memory with a trained retriever is more effective than attention based cache selection.

WHY IT MATTERS

What this unlocks for the field

M+ unlocks practical long-term knowledge retention, keeping useful information over 160k tokens away using Θ and a co-trained retriever on commodity GPUs.

Builders can now design applications like book-scale QA, lifelong agents, and multi session reasoning that were impractical with fixed context windows or pure KV cache tricks.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

Questions about this paper?

Paper: M+: Extending MemoryLLM with Scalable Long-Term Memory

Answers use this explainer on Memory Papers.

Checking…