MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

AuthorsZijian Zhou, Ao Qu, Zhaoxuan Wu et al.

2025

TL;DR

MEM1 uses a learned internal state plus masked RL trajectories to keep memory nearly constant while reaching 1.97 EM on 16-objective QA, 3.5× over Qwen2.5-14B-Instruct.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents collapse as context grows unboundedly (3.7× more memory on 16-objective QA)

Most long-horizon agents append all past turns, causing unbounded context growth and degraded reasoning on out-of-distribution input lengths.

On a 16-objective multi-hop QA task, Qwen2.5-14B-Instruct needs 38.4×10² peak tokens and still drops to EM 0.567, wasting memory, compute, and accuracy.

HOW IT WORKS

MEM1: learning 1-step integrated reasoning and consolidation

MEM1 centers on a compact Internal State (IS), Masked Trajectory for Policy Optimization, 2D Attention Mask, and Multi-Objective Task Design to fuse memory and reasoning.

Think of MEM1 like a human using a single evolving notebook page: each turn rewrites the page with only the essentials, discarding clutter.

This unified consolidation lets MEM1 maintain near-constant context while still chaining many environment interactions, something a plain context window cannot sustain.

DIAGRAM

MEM1 rollout and masked trajectory during RL training

This diagram shows how MEM1 rolls out multi-turn interactions, prunes context each turn, and reconstructs a masked trajectory for PPO updates.

DIAGRAM

MEM1 evaluation pipeline on multi-objective QA and WebShop

This diagram shows how MEM1 is trained on 2-objective QA and WebShop, then evaluated on longer-horizon QA and web navigation benchmarks.

PROCESS

How MEM1 Handles a Multi-Objective Multi-Hop QA Session

01
Memory as Part of Reasoning
MEM1 initializes an Internal State (IS) and, at each turn, generates a new <IS_t> that summarizes past information and plans next actions.
02
Multi-Objective Task Design
MEM1 receives composite questions built from interleaved HotpotQA and Natural Question items, forcing multiple queries and integrated reasoning across objectives.
03
Masked Trajectory for Policy Optimization
MEM1 reconstructs a stitched trajectory of <IS_t>, <query_t>, <info_t> tuples and applies a 2D Attention Mask to respect consolidated memory at each token.
04
Reward Assignment and Policy Update
MEM1 gets rewards from exact match or environment reward, applies KL penalty, and uses PPO updates on masked trajectories for both actor and critic.

KEY CONTRIBUTIONS

Key Contributions

01
Learning to Synergize Memory and Reasoning
MEM1 trains Internal State (IS) updates so reasoning traces double as working memory, enabling near-constant context even on 16-objective tasks with 10.4×10² peak tokens.
02
Masked Trajectory for Policy Optimization
MEM1 introduces a 2D Attention Mask over stitched trajectories, ensuring PPO advantages and KL penalties respect per-turn consolidated memory.
03
Multi-Objective Task Design for Long Horizons
MEM1 composes existing QA datasets into N-objective tasks, training on 2-objective data yet generalizing to 16-objective QA with EM 1.97.

RESULTS

By the Numbers

EM on 16-objective QA

1.97

+1.403 over Qwen2.5-14B-Instruct (0.567)

Peak Token Usage 16-objective

10.4×10² tokens

27.1% of Qwen2.5-14B-Instruct (38.4×10²)

Inference Time 16-objective

8.70 s

29.3% of Qwen2.5-14B-Instruct (29.7 s)

WebShop Avg Final Reward

70.87

+7.27 over AgentLM-7B (63.60) with 2.8× lower peak tokens

On multi-objective multi-hop QA built from HotpotQA and Natural Question, MEM1-7B is trained on 2-objective tasks and evaluated up to 16 objectives. The 16-objective EM 1.97 with 10.4×10² peak tokens shows MEM1 can scale horizon length while keeping memory and latency low.

BENCHMARK

By the Numbers

BENCHMARK

Multi-objective 16-Objective QA: Exact Match Comparison

Exact Match (EM) on 16-objective multi-hop QA compositions.

KEY INSIGHT

The Counterintuitive Finding

MEM1-7B, trained only on 2-objective QA, reaches EM 1.97 on 16-objective QA, beating Qwen2.5-14B-Instruct at EM 0.567.

This is surprising because longer horizons usually hurt smaller models, yet MEM1’s constant-memory Internal State lets a 7B model surpass a 14B baseline as objectives increase.

WHY IT MATTERS

What this unlocks for the field

MEM1 shows that long-horizon agents can learn to compress interaction history into a single evolving Internal State without external memory modules.

Builders can now design research or web agents that handle 16-objective workflows with roughly constant context size, reducing GPU memory and inference time while retaining accuracy.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…