M+: Extending MemoryLLM with Scalable Long-Term Memory

AuthorsYu Wang, Dmitry Krotov, Yuanzhe Hu et al.

2025

TL;DR

M+ adds a co-trained long-term latent memory on top of MemoryLLM’s short-term pool, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Latent-memory LLMs forget information beyond 20k tokens

MemoryLLM compresses history into a 1B-parameter latent memory pool but “struggles to retain knowledge beyond 20k tokens.”

This failure limits long-context tasks like LongBook-QA and knowledge retention where important facts appear over 100k tokens earlier, causing answer accuracy to collapse.

HOW IT WORKS

M+: Long-term latent memory with a co-trained retriever

M+ combines short-term memory θ, long-term memory Θ, a retriever with fq and fk, and a Multi-LoRA design on top of Llama-3.1-8B.

You can think of θ as fast RAM and Θ as slower but much larger disk, with the retriever acting like an index that pulls only relevant blocks.

This architecture lets M+ recall and reuse information injected over 160k tokens ago, far beyond what a fixed context window or plain MemoryLLM can access.

DIAGRAM

Update and generation flow across short-term and long-term memory

This diagram shows how M+ updates θ and Θ during ingestion and retrieves K0 tokens from Θ during generation at each transformer layer.

DIAGRAM

Three-stage training curriculum and ablation design

This diagram shows how M+ is trained in three stages and how MemoryLLM-8B, MemoryLLM-8B-Long, and M+ are compared in ablations.

PROCESS

How M+ Handles a Long-Context Question Answering Task

01
Update Process
During the Update Process, M+ feeds each chunk through ϕ, updates short-term memory θ, and drops K tokens per layer instead of discarding them.
02
Equipping MemoryLLM with Long-Term Memory
In Equipping MemoryLLM with Long-Term Memory, M+ stores dropped tokens into long-term memory Θ with ages and enforces a maximum size M=150k.
03
Retriever Design and Training
In Retriever Design and Training, M+ learns fq and fk so query hidden states can retrieve relevant Θ tokens via dot products against stored keys.
04
Training with long-term memory
During Training with long-term memory (Stage 3), M+ uses SlimPajama long documents so ϕ learns to interpret retrieved Θ tokens alongside θ during generation.

KEY CONTRIBUTIONS

Key Contributions

01
Equipping MemoryLLM with Long-Term Memory
M+ augments short-term memory θ with long-term memory Θ stored on CPU, using K=256, N=10240, and M=150k to retain dropped tokens instead of discarding them.
02
Co-trained retriever for efficient memory retrieval
M+ introduces a retriever with fq and fk that projects hidden states to dimension dproj=d/20 and retrieves only once per layer for all heads, reducing latency.
03
Three-stage long-context training curriculum
M+ uses continual training on fineweb-edu, then SlimPajama 4k–64k, then Training with long-term memory, progressively improving long-context loss and retention beyond 160k tokens.

RESULTS

By the Numbers

LongBench Avg 16k

35.81

+3.03 over M+ (16k) on LongBench average (35.81 vs 32.78)

LongBench Avg 8k

33.39

+2.39 over M+ (8k) on LongBench average (33.39 vs 31.00)

GPU Memory Cost

21177.76 MB

-9296.73 MB vs Llama-3.1-8B-SnapKV (21177.76 vs 30422.70)

GPU Memory Offload

17973.34 MB

-1451.87 MB vs Llama-3.1-8B-16k (17973.34 vs 19239.21)

On LongBench, Llama-3.1-8B reaches 35.81 average F1 at 16k while M+ reaches 32.78, trading a 3.03 drop for much longer retention. On GPU memory, M+ uses 21177.76 MB (17973.34 MB with offloading) versus 32574.49 MB for Llama-3.1-8B-SnapKV, showing that M+ extends retention without increasing GPU footprint.

BENCHMARK

By the Numbers

BENCHMARK

GPU Memory Cost Comparison across Methods

Maximum GPU memory allocated (MB) during inference across long-context benchmarks.

KEY INSIGHT

The Counterintuitive Finding

Even with a 48k prompt, Llama-3.1-8B-SnapKV “struggles to recall information injected more than 30k tokens earlier,” while M+ retains beyond 160k.

This is surprising because larger key value caches were expected to help, but M+ shows that explicitly stored latent memory with a trained retriever is more effective than attention based cache selection.

WHY IT MATTERS

What this unlocks for the field

M+ unlocks practical long-term knowledge retention, keeping useful information over 160k tokens away using Θ and a co-trained retriever on commodity GPUs.

Builders can now design applications like book-scale QA, lifelong agents, and multi session reasoning that were impractical with fixed context windows or pure KV cache tricks.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

arXiv:2601.20540 Read explainer

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

arXiv:2604.12179 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…