Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

AuthorsYanchen Wu, Tenghui Lin, Yingli Zhou et al.

2026

TL;DR

Memory in the LLM Era uses a four-stage modular memory framework plus a new hierarchical tree–tier design to reach 38.79 F1 on LONGMEMEVAL, +1.87 over MemTree.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents lose critical facts in overflowing contexts

LLM agents face context overflow, where naive long-context prompting becomes token-intensive, high-latency, and unreliable for long-term conversations.

When LOCOMO conversations average 588.2 dialogue turns and LONGMEMEVAL histories reach about 115,000 tokens, agents fail at multi-session reasoning and temporal consistency.

HOW IT WORKS

A unified four-stage memory framework

Memory in the LLM Era centers on four modules: Information Extraction, Memory Management, Memory Storage, and Information Retrieval to cover all agent memory methods.

Think of Information Extraction and Memory Management as a cognitive front-end, while Memory Storage and Information Retrieval act like long-term disk plus an intelligent index.

This modular design lets Memory in the LLM Era mix and match components, enabling hierarchical, tree-based, and rule-driven memory behaviors beyond a plain context window.

DIAGRAM

Memory in the LLM Era query-time retrieval pipeline

This diagram shows how Memory in the LLM Era retrieves and injects relevant memories when a new query arrives.

DIAGRAM

Evaluation pipeline across LOCOMO and LONGMEMEVAL

This diagram shows how Memory in the LLM Era evaluates modular memory methods on LOCOMO and LONGMEMEVAL with shared settings.

PROCESS

How Memory in the LLM Era Handles a Long-term Conversation Session

  1. 01

    Information Extraction

    Memory in the LLM Era first applies Information Extraction, using direct archiving, summarization-based extraction, or graph-based extraction to convert messages into structured memories.

  2. 02

    Memory Management

    Then Memory in the LLM Era runs Memory Management to connect, integrate, transform, update, and filter memories, mirroring human-like consolidation and forgetting.

  3. 03

    Memory Storage

    Next Memory in the LLM Era organizes processed memories into flat or hierarchical Memory Storage using vector-based, graph-based, or tree-based structures.

  4. 04

    Information Retrieval

    Finally, Memory in the LLM Era uses Information Retrieval with lexical-based, vector-based, structure-based, or LLM-assisted retrieval to assemble context for the LLM.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Unified modular framework for agent memory

    Memory in the LLM Era formalizes agent memory as four components—Information Extraction, Memory Management, Memory Storage, and Information Retrieval—covering 10 methods like MemGPT, MemoryOS, and MemTree.

  • 02

    Comprehensive experimental study on LOCOMO and LONGMEMEVAL

    Memory in the LLM Era reimplements 10 memory methods and evaluates them on LOCOMO and LONGMEMEVAL, analyzing F1, BLEU-1, token costs, context scalability, and position sensitivity.

  • 03

    New agent memory method with state-of-the-art performance

    Memory in the LLM Era designs a new tree plus three-tier memory variant that reaches 38.79 F1 on LONGMEMEVAL and 43.87 F1 on LOCOMO with Qwen2.5-72B.

RESULTS

By the Numbers

Overall F1 LONGMEMEVAL

38.79

+1.87 over MemTree

Overall F1 LOCOMO Qwen2.5-72B

43.87

+1.08 over MemOS

Information Extraction assistant F1

69.34

+11.55 over MemTree

Average token costs per dialogue

<450 tokens

Lower than MemTree and MemOS in Figure 10

Memory in the LLM Era is evaluated on LONGMEMEVAL and LOCOMO, which test long-term conversational memory, multi-session reasoning, and temporal reasoning. The 38.79 F1 and 43.87 F1 results show that the modular tree–tier design improves accuracy while keeping token usage low.

BENCHMARK

By the Numbers

Memory in the LLM Era is evaluated on LONGMEMEVAL and LOCOMO, which test long-term conversational memory, multi-session reasoning, and temporal reasoning. The 38.79 F1 and 43.87 F1 results show that the modular tree–tier design improves accuracy while keeping token usage low.

BENCHMARK

Overall F1 on LONGMEMEVAL with Qwen2.5-7B-Instruct

Overall F1 scores comparing Memory in the LLM Era and strong baselines on LONGMEMEVAL.

BENCHMARK

Overall F1 on LOCOMO with Qwen2.5-72B-Instruct

Overall F1 scores comparing Memory in the LLM Era and strong baselines on LOCOMO.

KEY INSIGHT

The Counterintuitive Finding

Memory in the LLM Era shows that coarser-grained extraction, like MemoryOS segment summaries, can reduce token costs without hurting F1 on LONGMEMEVAL.

This is surprising because many assume finer-grained turn-level memories are always better, but the results show that carefully chosen granularity plus strong LLM reasoning can be more efficient.

WHY IT MATTERS

What this unlocks for the field

Memory in the LLM Era enables practitioners to reason about agent memory as interchangeable modules, not monolithic designs tied to a single system.

Builders can now systematically combine extraction, management, storage, and retrieval choices—like tree indices with three-tier storage—to design memory-augmented agents tuned for accuracy, cost, and robustness.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Agent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

Agent Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

Questions about this paper?

Paper: Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

Answers use this explainer on Memory Papers.

Checking…