LightMem: Lightweight and Efficient Memory-Augmented Generation

AuthorsJizhan Fang, Xinle Deng, Haoming Xu et al.

2025

TL;DR

LightMem uses sensory pre-compression, topic-aware STM, and sleep-time LTM updates to cut LongMemEval token usage by up to 38× while boosting accuracy by up to 7.67% over A-MEM.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents waste tokens and miss context in long interactions

LightMem targets scenarios where existing memory systems inflate token usage without proportional gains, e.g., A-MEM consuming up to 2,903.78k tokens on LongMemEval-S.

In such long multi-turn dialogues, GPT-4o-mini with A-MEM reaches 62.60% ACC but pays 986.55 API calls and 5,132.06 seconds runtime, hurting latency, efficiency, and coherence.

HOW IT WORKS

LightMem — sensory compression, topic-aware STM, and sleep-time LTM

LightMem combines Light1: Cognitive-Inspired Sensory Memory, Topic Segmentation Submodule, Light2: Topic-Aware Short-Term Memory, and Light3: Long-Term Memory with Sleep-Time Update into a three-stage memory pipeline.

You can think of LightMem like a brain: Light1 is sensory filtering, Light2 is working memory grouping by topic, and Light3 is deep consolidation during sleep.

This design lets LightMem compress and reorganize history beyond a plain context window, enabling topic-structured summaries and parallel offline updates that conventional sequential fupdate pipelines cannot achieve.

DIAGRAM

Online interaction and memory update flow in LightMem

This diagram shows how LightMem processes each dialogue turn online, performs soft LTM inserts, and later runs offline sleep-time updates.

DIAGRAM

Evaluation setup for LightMem on LongMemEval and LoCoMo

This diagram shows how LightMem is evaluated across datasets, backbones, and baselines, and where summary and update costs are measured.

PROCESS

How LightMem Handles an Incremental Dialogue Turn Feeding Session

  1. 01

    Pre-Compressing Submodule

    LightMem uses the Pre-Compressing Submodule in Light1 with LLMLingua-2 to retain only high-information tokens based on retention probabilities and cross-entropy.

  2. 02

    Topic Segmentation Submodule

    LightMem applies the Topic Segmentation Submodule, combining attention-based boundaries B1 and similarity-based boundaries B2 to form topic segments B in the sensory buffer.

  3. 03

    Light2 Topic-Aware Short-Term Memory

    LightMem groups topic segments into the STM buffer, and when token capacity th is reached, LightMem calls fsum to summarize segments into sumi and embeddings ei.

  4. 04

    Light3 Long-Term Memory with Sleep-Time Update

    LightMem inserts new entries softly into LTM at test time, then later builds update queues Q(ei) and runs offline parallel fupdate operations during sleep-time.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Light1: Cognitive-Inspired Sensory Memory

    LightMem introduces a Light1 Sensory Memory Module with a Pre-Compressing Submodule and Topic Segmentation Submodule, using LLMLingua-2 to cut prompt tokens by up to 50–60% without hurting QA accuracy.

  • 02

    Light2: Topic-Aware Short-Term Memory

    LightMem designs Light2 STM to group turns into topic segments, reducing summarization calls from O(N) to O(N r x T / th) and improving segmentation accuracy above 80% across compression ratios.

  • 03

    Light3: Long-Term Memory with Sleep-Time Update

    LightMem proposes Light3 with soft online inserts and offline parallel updates, achieving up to 105.9× online token reduction and 159.4× fewer API calls on LongMemEval-S with GPT-4o-mini.

RESULTS

By the Numbers

ACC (%) on LongMemEval-S (Qwen)

70.20%

+5.00 over A-MEM

Total Tokens (k) on LongMemEval-S (GPT)

28.25k

−1,577.56k vs A-MEM

API Calls on LongMemEval-S (GPT)

18.43

−968.12 vs A-MEM

ACC (%) on LoCoMo (GPT)

72.99%

+8.83 over A-MEM

On LongMemEval-S, which tests long-horizon conversational QA, LightMem with Qwen3-30B-A3B-Instruct-2507 reaches 70.20% ACC versus 65.20% for A-MEM while using far fewer tokens and API calls. On LoCoMo, which stresses long-context memory and reasoning, LightMem with GPT-4o-mini attains 72.99% ACC compared to 64.16% for A-MEM, confirming that LightMem’s lightweight memory pipeline improves both effectiveness and efficiency.

BENCHMARK

By the Numbers

On LongMemEval-S, which tests long-horizon conversational QA, LightMem with Qwen3-30B-A3B-Instruct-2507 reaches 70.20% ACC versus 65.20% for A-MEM while using far fewer tokens and API calls. On LoCoMo, which stresses long-context memory and reasoning, LightMem with GPT-4o-mini attains 72.99% ACC compared to 64.16% for A-MEM, confirming that LightMem’s lightweight memory pipeline improves both effectiveness and efficiency.

BENCHMARK

Effectiveness comparison on LongMemEval-S with GPT-4o-mini

ACC (%) on LongMemEval-S for LightMem and key memory baselines using GPT-4o-mini.

BENCHMARK

Effectiveness comparison on LoCoMo with GPT-4o-mini

ACC (%) on LoCoMo for LightMem and key memory baselines using GPT-4o-mini.

KEY INSIGHT

The Counterintuitive Finding

LightMem shows that compressing prompts by 50–80% using LLMLingua-2 keeps QA accuracy comparable to uncompressed prompts on LongMemEval.

This is surprising because many practitioners assume aggressive token removal will cripple reasoning, yet LightMem’s Pre-Compressing Submodule preserves enough semantic signal for stable performance.

WHY IT MATTERS

What this unlocks for the field

LightMem enables LLM agents to maintain rich, structured long-term memories while cutting online token and API costs by over two orders of magnitude in some settings.

Builders can now deploy long-horizon conversational agents that stay coherent over many sessions without prohibitive latency or cost, and can run heavy consolidation asynchronously during sleep-time windows.

~14 min read← Back to papers

Related papers

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

Survey

Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang et al.

· 2026

MEMORYARENA orchestrates Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, Group Travel Planning, and Progressive Web Search to stress-test how agents store and reuse information across sessions. MEMORYARENA’s main result is that agents with near-saturated scores on long-context benchmarks like LoCoMo still obtain Task Success Rates as low as 0.00–0.12 across its four environments.

Memory ArchitectureSurvey

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes Agent IO Layer, Agent Cache Layer, Agent Memory Layer, Agent Cache Sharing, and Agent Memory Access Protocol into a computer-architecture-style design for LLM agents. Multi-Agent Memory Architecture’s main result is a conceptual unification of shared and distributed memory plus a research agenda for multi-agent memory consistency instead of benchmark gains.

Questions about this paper?

Paper: LightMem: Lightweight and Efficient Memory-Augmented Generation

Answers use this explainer on Memory Papers.

Checking…