Lightweight LLM Agent Memory with Small Language Models

AuthorsJiaquan Zhang, Chaoning Zhang, Shuxu Chen et al.

2026

TL;DR

LightMem uses SLM based two stage retrieval and STM MTM LTM stores to gain about +2.5 F1 on LoCoMo with 83 ms retrieval latency.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents trade accuracy for latency in long term memory

Retrieval based external memory has low online overhead but suffers from unstable accuracy due to limited query construction and candidate filtering.

LLM driven memory operations improve answer correctness but require repeated large model calls, accumulating latency over long interactions and harming user experience.

HOW IT WORKS

LightMem — SLM driven online memory with STM MTM LTM

LightMem uses SLM-1 Controller, SLM-2 Selector, SLM-3 Writer, and structured STM MTM LTM stores to decouple online querying from offline consolidation.

You can think of LightMem like RAM and disk: STM is working RAM, MTM is a user specific cache, and LTM is a compact shared knowledge disk.

This design lets LightMem run fixed budget, semantically verified retrieval and incremental consolidation, something a plain context window replay cannot provide efficiently over long horizons.

DIAGRAM

Online query time retrieval and writing flow in LightMem

This diagram shows how LightMem processes a single user turn using SLM-1, SLM-2, and SLM-3 with STM MTM LTM.

DIAGRAM

LightMem evaluation and ablation pipeline

This diagram shows how LightMem is evaluated on LoCoMo and DialSim with baselines and ablations.

PROCESS

How LightMem Handles a Multi turn Dialogue Session

  1. 01

    Intent Modeling and Retrieval Control

    LightMem uses SLM-1 Controller to infer intent attributes, generate hypothetical queries, and output metadata constraints and Top K budgets.

  2. 02

    Two Stage Retrieval

    LightMem runs metadata constrained coarse vector retrieval then SLM-2 Selector performs semantic consistency reranking and compresses 2K candidates to K memories.

  3. 03

    Memory Writing and Update

    After response generation, SLM-3 Writer summarizes the interaction into compact MTM entries, merging repetitive items and handling conflicts with temporal cues.

  4. 04

    Offline Consolidation

    A large context LLM periodically abstracts high value MTM episodes into de identified knowledge, incrementally updating the graph structured LTM with forgetting.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    LightMem SLM driven memory system

    LightMem introduces specialized SLM-1 Controller, SLM-2 Selector, and SLM-3 Writer to handle online query construction, retrieval, and writing under fixed Top K budgets.

  • 02

    Two stage memory querying design

    LightMem first performs fast vector based coarse retrieval to 2K candidates, then uses semantic consistency verification in SLM-2 Selector to keep K truly relevant memories.

  • 03

    STM MTM LTM organization with isolation

    LightMem organizes STM, MTM, and graph structured LTM with user identifiers, achieving about +2.5 average F1 on LoCoMo and 83 ms median retrieval latency.

RESULTS

By the Numbers

F1 multi hop GPT4o

34.50

+1.64 over A-MEM

F1 adversarial GPT4o mini

54.50

+4.47 over A-MEM

SBERT similarity DialSim

23.40

+3.89 over A-MEM

Retrieval Latency P50

83 ms

vs 856 ms for A-MEM

On LoCoMo, which tests single hop, multi hop, temporal, open domain, and adversarial reasoning, LightMem improves F1 while shortening effective context. On DialSim, LightMem increases SBERT similarity from 19.51 to 23.40, showing stronger semantic consistency in long term dialogue.

BENCHMARK

By the Numbers

On LoCoMo, which tests single hop, multi hop, temporal, open domain, and adversarial reasoning, LightMem improves F1 while shortening effective context. On DialSim, LightMem increases SBERT similarity from 19.51 to 23.40, showing stronger semantic consistency in long term dialogue.

BENCHMARK

Main results on LoCoMo multi hop with GPT 4o

F1 on LoCoMo multi hop questions using GPT 4o as response generator.

BENCHMARK

Latency comparison on GPT 4o mini

Retrieval Latency P50 in milliseconds for memory systems with GPT 4o mini.

KEY INSIGHT

The Counterintuitive Finding

LightMem keeps median retrieval latency at 83 ms while still improving LoCoMo F1 by about 2.5 on average across model scales.

This is surprising because adding SLM based controllers and reranking seems like extra overhead, yet LightMem is faster than A-MEM with 856 ms retrieval latency.

WHY IT MATTERS

What this unlocks for the field

LightMem enables LLM agents to maintain long horizon, user specific memory with fixed retrieval budgets and stable accuracy across diverse backbones.

Builders can now deploy multi session agents that personalize over time without replaying 16K token histories or paying repeated large model costs for every memory operation.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Agent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

Agent Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

Questions about this paper?

Paper: Lightweight LLM Agent Memory with Small Language Models

Answers use this explainer on Memory Papers.

Checking…