Benchmark Long-Term Memory Memory Architecture

LightMem: Lightweight and Efficient Memory-Augmented Generation

AuthorsJizhan Fang, Xinle Deng, Haoming Xu et al.

2025

TL;DR

LightMem uses sensory pre-compression, topic-aware STM, and sleep-time LTM updates to cut LongMemEval token usage by up to 38× while boosting accuracy by up to 7.67% over A-MEM.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents waste tokens and miss context in long interactions

LightMem targets scenarios where existing memory systems inflate token usage without proportional gains, e.g., A-MEM consuming up to 2,903.78k tokens on LongMemEval-S.

In such long multi-turn dialogues, GPT-4o-mini with A-MEM reaches 62.60% ACC but pays 986.55 API calls and 5,132.06 seconds runtime, hurting latency, efficiency, and coherence.

HOW IT WORKS

LightMem — sensory compression, topic-aware STM, and sleep-time LTM

LightMem combines Light1: Cognitive-Inspired Sensory Memory, Topic Segmentation Submodule, Light2: Topic-Aware Short-Term Memory, and Light3: Long-Term Memory with Sleep-Time Update into a three-stage memory pipeline.

You can think of LightMem like a brain: Light1 is sensory filtering, Light2 is working memory grouping by topic, and Light3 is deep consolidation during sleep.

This design lets LightMem compress and reorganize history beyond a plain context window, enabling topic-structured summaries and parallel offline updates that conventional sequential fupdate pipelines cannot achieve.

DIAGRAM

Online interaction and memory update flow in LightMem

This diagram shows how LightMem processes each dialogue turn online, performs soft LTM inserts, and later runs offline sleep-time updates.

DIAGRAM

Evaluation setup for LightMem on LongMemEval and LoCoMo

This diagram shows how LightMem is evaluated across datasets, backbones, and baselines, and where summary and update costs are measured.

PROCESS

How LightMem Handles an Incremental Dialogue Turn Feeding Session

01
Pre-Compressing Submodule
LightMem uses the Pre-Compressing Submodule in Light1 with LLMLingua-2 to retain only high-information tokens based on retention probabilities and cross-entropy.
02
Topic Segmentation Submodule
LightMem applies the Topic Segmentation Submodule, combining attention-based boundaries B1 and similarity-based boundaries B2 to form topic segments B in the sensory buffer.
03
Light2 Topic-Aware Short-Term Memory
LightMem groups topic segments into the STM buffer, and when token capacity th is reached, LightMem calls fsum to summarize segments into sumi and embeddings ei.
04
Light3 Long-Term Memory with Sleep-Time Update
LightMem inserts new entries softly into LTM at test time, then later builds update queues Q(ei) and runs offline parallel fupdate operations during sleep-time.

KEY CONTRIBUTIONS

Key Contributions

01
Light1: Cognitive-Inspired Sensory Memory
LightMem introduces a Light1 Sensory Memory Module with a Pre-Compressing Submodule and Topic Segmentation Submodule, using LLMLingua-2 to cut prompt tokens by up to 50–60% without hurting QA accuracy.
02
Light2: Topic-Aware Short-Term Memory
LightMem designs Light2 STM to group turns into topic segments, reducing summarization calls from O(N) to O(N r x T / th) and improving segmentation accuracy above 80% across compression ratios.
03
Light3: Long-Term Memory with Sleep-Time Update
LightMem proposes Light3 with soft online inserts and offline parallel updates, achieving up to 105.9× online token reduction and 159.4× fewer API calls on LongMemEval-S with GPT-4o-mini.

RESULTS

By the Numbers

ACC (%) on LongMemEval-S (Qwen)

70.20%

+5.00 over A-MEM

Total Tokens (k) on LongMemEval-S (GPT)

28.25k

−1,577.56k vs A-MEM

API Calls on LongMemEval-S (GPT)

18.43

−968.12 vs A-MEM

ACC (%) on LoCoMo (GPT)

72.99%

+8.83 over A-MEM

On LongMemEval-S, which tests long-horizon conversational QA, LightMem with Qwen3-30B-A3B-Instruct-2507 reaches 70.20% ACC versus 65.20% for A-MEM while using far fewer tokens and API calls. On LoCoMo, which stresses long-context memory and reasoning, LightMem with GPT-4o-mini attains 72.99% ACC compared to 64.16% for A-MEM, confirming that LightMem’s lightweight memory pipeline improves both effectiveness and efficiency.

BENCHMARK

By the Numbers

BENCHMARK

Effectiveness comparison on LongMemEval-S with GPT-4o-mini

ACC (%) on LongMemEval-S for LightMem and key memory baselines using GPT-4o-mini.

BENCHMARK

Effectiveness comparison on LoCoMo with GPT-4o-mini

ACC (%) on LoCoMo for LightMem and key memory baselines using GPT-4o-mini.

KEY INSIGHT

The Counterintuitive Finding

LightMem shows that compressing prompts by 50–80% using LLMLingua-2 keeps QA accuracy comparable to uncompressed prompts on LongMemEval.

This is surprising because many practitioners assume aggressive token removal will cripple reasoning, yet LightMem’s Pre-Compressing Submodule preserves enough semantic signal for stable performance.

WHY IT MATTERS

What this unlocks for the field

LightMem enables LLM agents to maintain rich, structured long-term memories while cutting online token and API costs by over two orders of magnitude in some settings.

Builders can now deploy long-horizon conversational agents that stay coherent over many sessions without prohibitive latency or cost, and can run heavy consolidation asynchronously during sleep-time windows.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…