MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

AuthorsZijian Zhou, Ao Qu, Zhaoxuan Wu et al.

2025

TL;DR

MEM1 uses a learned internal state plus masked RL trajectories to keep memory nearly constant while reaching 1.97 EM on 16-objective QA, 3.5× over Qwen2.5-14B-Instruct.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents collapse as context grows unboundedly (3.7× more memory on 16-objective QA)

Most long-horizon agents append all past turns, causing unbounded context growth and degraded reasoning on out-of-distribution input lengths.

On a 16-objective multi-hop QA task, Qwen2.5-14B-Instruct needs 38.4×10² peak tokens and still drops to EM 0.567, wasting memory, compute, and accuracy.

HOW IT WORKS

MEM1: learning 1-step integrated reasoning and consolidation

MEM1 centers on a compact Internal State (IS), Masked Trajectory for Policy Optimization, 2D Attention Mask, and Multi-Objective Task Design to fuse memory and reasoning.

Think of MEM1 like a human using a single evolving notebook page: each turn rewrites the page with only the essentials, discarding clutter.

This unified consolidation lets MEM1 maintain near-constant context while still chaining many environment interactions, something a plain context window cannot sustain.

DIAGRAM

MEM1 rollout and masked trajectory during RL training

This diagram shows how MEM1 rolls out multi-turn interactions, prunes context each turn, and reconstructs a masked trajectory for PPO updates.

DIAGRAM

MEM1 evaluation pipeline on multi-objective QA and WebShop

This diagram shows how MEM1 is trained on 2-objective QA and WebShop, then evaluated on longer-horizon QA and web navigation benchmarks.

PROCESS

How MEM1 Handles a Multi-Objective Multi-Hop QA Session

  1. 01

    Memory as Part of Reasoning

    MEM1 initializes an Internal State (IS) and, at each turn, generates a new <IS_t> that summarizes past information and plans next actions.

  2. 02

    Multi-Objective Task Design

    MEM1 receives composite questions built from interleaved HotpotQA and Natural Question items, forcing multiple queries and integrated reasoning across objectives.

  3. 03

    Masked Trajectory for Policy Optimization

    MEM1 reconstructs a stitched trajectory of <IS_t>, <query_t>, <info_t> tuples and applies a 2D Attention Mask to respect consolidated memory at each token.

  4. 04

    Reward Assignment and Policy Update

    MEM1 gets rewards from exact match or environment reward, applies KL penalty, and uses PPO updates on masked trajectories for both actor and critic.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Learning to Synergize Memory and Reasoning

    MEM1 trains Internal State (IS) updates so reasoning traces double as working memory, enabling near-constant context even on 16-objective tasks with 10.4×10² peak tokens.

  • 02

    Masked Trajectory for Policy Optimization

    MEM1 introduces a 2D Attention Mask over stitched trajectories, ensuring PPO advantages and KL penalties respect per-turn consolidated memory.

  • 03

    Multi-Objective Task Design for Long Horizons

    MEM1 composes existing QA datasets into N-objective tasks, training on 2-objective data yet generalizing to 16-objective QA with EM 1.97.

RESULTS

By the Numbers

EM on 16-objective QA

1.97

+1.403 over Qwen2.5-14B-Instruct (0.567)

Peak Token Usage 16-objective

10.4×10² tokens

27.1% of Qwen2.5-14B-Instruct (38.4×10²)

Inference Time 16-objective

8.70 s

29.3% of Qwen2.5-14B-Instruct (29.7 s)

WebShop Avg Final Reward

70.87

+7.27 over AgentLM-7B (63.60) with 2.8× lower peak tokens

On multi-objective multi-hop QA built from HotpotQA and Natural Question, MEM1-7B is trained on 2-objective tasks and evaluated up to 16 objectives. The 16-objective EM 1.97 with 10.4×10² peak tokens shows MEM1 can scale horizon length while keeping memory and latency low.

BENCHMARK

By the Numbers

On multi-objective multi-hop QA built from HotpotQA and Natural Question, MEM1-7B is trained on 2-objective tasks and evaluated up to 16 objectives. The 16-objective EM 1.97 with 10.4×10² peak tokens shows MEM1 can scale horizon length while keeping memory and latency low.

BENCHMARK

Multi-objective 16-Objective QA: Exact Match Comparison

Exact Match (EM) on 16-objective multi-hop QA compositions.

KEY INSIGHT

The Counterintuitive Finding

MEM1-7B, trained only on 2-objective QA, reaches EM 1.97 on 16-objective QA, beating Qwen2.5-14B-Instruct at EM 0.567.

This is surprising because longer horizons usually hurt smaller models, yet MEM1’s constant-memory Internal State lets a 7B model surpass a 14B baseline as objectives increase.

WHY IT MATTERS

What this unlocks for the field

MEM1 shows that long-horizon agents can learn to compress interaction history into a single evolving Internal State without external memory modules.

Builders can now design research or web agents that handle 16-objective workflows with roughly constant context size, reducing GPU memory and inference time while retaining accuracy.

~12 min read← Back to papers

Related papers

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

Survey

Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang et al.

· 2026

MEMORYARENA orchestrates Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, Group Travel Planning, and Progressive Web Search to stress-test how agents store and reuse information across sessions. MEMORYARENA’s main result is that agents with near-saturated scores on long-context benchmarks like LoCoMo still obtain Task Success Rates as low as 0.00–0.12 across its four environments.

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes Agent IO Layer, Agent Cache Layer, Agent Memory Layer, Agent Cache Sharing, and Agent Memory Access Protocol into a computer-architecture-style design for LLM agents. Multi-Agent Memory Architecture’s main result is a conceptual unification of shared and distributed memory plus a research agenda for multi-agent memory consistency instead of benchmark gains.

Questions about this paper?

Paper: MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Answers use this explainer on Memory Papers.

Checking…