Category

Working Memory

Working memory and short-term context management in LLMs — context windows, compression, and scratchpad mechanisms.

24 papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

BenchmarkAgent Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

BenchmarkAgent Memory

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Mofasshara Rafique, Laurent Bindschaedler

· 2026

ClawVM manages agent state as typed pages via the SessionPageTable, RepresentationSelector, FaultObserver, WritebackJournal, and ClawVMEngine inside the agent harness. Across four OpenClaw-derived workloads and six token budgets, ClawVM cuts explicit faults from 67.8 (retrieval baseline) and 1.5 (Compaction-Hybrid) to 0.0 while adding median <50 μs policy-engine overhead per turn.

BenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Lightweight LLM Agent Memory with Small Language Models

Jiaquan Zhang, Chaoning Zhang et al.

· 2026

LightMem orchestrates SLM-1 Controller, SLM-2 Selector, SLM-3 Writer, and STM MTM LTM stores to modularize retrieval, writing, and offline consolidation. On LoCoMo, LightMem reaches 34.50 F1 for GPT-4o multi hop questions, +1.64 over A-MEM, while keeping median retrieval latency at 83 ms.

RAGBenchmarkBenchmarkMemory Architecture

Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Chulun Zhou, Chunkang Zhang et al.

· 2025

HGMEM represents working memory as a hypergraph with Hypergraph-based Memory Storage, Adaptive Memory-based Evidence Retrieval, and Dynamic Memory Evolving to build high-order correlations across entities and facts. On Prelude long narrative understanding, HGMEM with GPT-4o achieves 73.81% accuracy compared to 72.22% for HippoRAG v2, while also reaching 69.74 comprehensiveness on Longbench generative sense-making QA.

BenchmarkLong-Term MemoryMemory Architecture

LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng et al.

· 2025

LightMem pipelines a Cognitive-Inspired Sensory Memory, Topic Segmentation Submodule, Topic-Aware Short-Term Memory, and Long-Term Memory with Sleep-Time Update to filter, group, summarize, and asynchronously consolidate dialogue history. On LongMemEval-S with Qwen3-30B-A3B-Instruct-2507, LightMem reaches 70.20% ACC vs 65.20% for A-MEM (+5.00 points) while reducing total token usage by up to 21.8× and API calls by up to 17.1×.

RAGBenchmarkBenchmarkMemory Architecture

Memory-Augmented Log Analysis with Phi-4-mini: Enhancing Threat Detection in Structured Security Logs

Anbi Guo, Mahfuza Farooque

· 2025

DM-RAG augments Phi-4-mini with a Short-Term Memory (STM) buffer, Long-Term Memory (LTM) FAISS store, Bayesian fusion, and a logistic regression confidence model for structured log analysis. On UNSW-NB15, DM-RAG reaches 98.70% recall and 69.59% F1, beating the Phi-4 + RAG (MITRE) baseline in F1 by 17.89 points.

RAGBenchmarkAgent Memory

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu et al.

· 2025

Memory in the Age of AI Agents formalizes agent memory with Memory Formation, Memory Evolution, and Memory Retrieval operators, and classifies memories into token-level, parametric, and latent forms plus factual, experiential, and working functions. Memory in the Age of AI Agents’ main result is a unified Forms–Functions–Dynamics framework that consolidates fragmented LLM agent memory work, benchmarks, and open-source frameworks into a coherent taxonomy.

BenchmarkMemory Architecture

MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

Stefano Zeppieri

· 2025

MMAG organizes conversational memory, long-term user memory, episodic and event-linked memories, sensory and context-aware memory, and short-term working memory under a modular memory controller integrated with Heero’s encrypted Firestore and S3 stores. MMAG delivers a 20% increase in user retention and a 30% increase in average conversation duration on the Heero language learning platform compared to its pre-memory deployment.

Benchmark

Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

Chupei Wang, Jiaqiu Vince Sun

· 2025

PI-LLM uses a synthetic key–value retrieval task, Interference Endurance Score (IES), per-key forget prompts, and a mock QA reset to stress-test working-memory-like behavior under proactive interference. PI-LLM finds a universal log-linear decay in retrieval accuracy across 0.6B–637B-parameter LLMs as interference grows, revealing that parameter size, not context window length, predicts interference robustness.

Benchmark

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

Mengkang Hu, Tianxing Chen et al.

· 2024

HiAgent manages working memory using Subgoal based Hierarchical Working Memory, Observation Summarization, and a Trajectory Retrieval module to chunk and selectively expand action observation histories. On five AgentBoard long horizon tasks, HiAgent reaches 42.00% overall success rate versus 21.00% for STANDARD, with 3.80 fewer average steps.

Benchmark

Improving Factuality with Explicit Working Memory

Mingda Chen, Yang Li et al.

· 2024

Ewe augments Llama-3.1 with an explicit working memory, real-time feedback, fact-checking outcomes, and relevant knowledge memories that are refreshed via FIFO KV-cache updates during decoding. On the Biography dataset, Ewe reaches 49.7 VeriScore F1 versus 37.1 for Llama-3.1 70B (+12.6), while keeping AlpacaEval win rate around 50% against the same baseline.

Benchmark

MemoNav: Working Memory Model for Visual Navigation

Hongxin Li, Zeyu Wang et al.

· 2024

MemoNav composes Short-term memory, Selective forgetting module, Long-term memory, Working memory generation, and Transformer decoders to focus navigation on goal-relevant topological map nodes. On Gibson 1-goal, MemoNav reaches 74.7% SR vs 70.0% for VGM, and on Gibson 4-goal multi-goal tasks MemoNav achieves 28.9% PR vs 21.5% for VGM.

Benchmark

TransformerFAM: Feedback attention is working memory

Dongseong Hwang, Weiran Wang et al.

· 2024

TransformerFAM augments Block Sliding Window Attention with Feedback Attention Memory, where Feedback Attention Memory (FAM) tokens attend to and compress block activations while queries jointly attend to BSWA memory segments and past FAM. On PassKey retrieval, TransformerFAM maintains 100% accuracy up to 260k filler tokens, while TransformerBSWA with 12 memory segments collapses after 20k tokens.

Benchmark

Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi

arXiv 2023 · 2023

Recurrent Neural Networks and Long Short-Term Memory Networks explains how Backpropagation Through Time, LSTM gates and cells, Gated Recurrent Units, bidirectional RNN, and ELMo fit into one dynamical-systems view. Recurrent Neural Networks and Long Short-Term Memory Networks mainly contributes a structured tutorial and survey rather than new benchmark numbers.

Benchmark

Working Memory Capacity of ChatGPT: An Empirical Study

Dongyu Gong, Xingchen Wan, Dingmin Wang

· 2023

Working Memory Capacity of ChatGPT probes ChatGPT with verbal and spatial n-back blocks, adding noise, feedback, and chain-of-thought variants to stress working memory. Working Memory Capacity of ChatGPT finds d′ falls to around 1 at n=3 in most conditions and that GPT-4’s verbal n-back capacity far exceeds Bloomz, ChatGLM, and Vicuna baselines.

Benchmark

Large Language Models with Controllable Working Memory

Daliang Li, Ankit Singh Rawat et al.

· 2022

Knowledge Aware Finetuning (KAFT) augments QA data with relevant context, irrelevant context, counterfactual context, and empty context built from SQuAD 2.0, T-REx, QASC, and TriviaQA, and uses pretrained model’s answer to label irrelevant slices. KAFT achieves up to 24× higher controllability than Noisy Finetuning and up to 6× higher robustness than Noisy Finetuning on PaLM 540B while matching TriviaQA validation accuracy.

Benchmark

Working Memory Connections for LSTM

Federico Landi, Lorenzo Baraldi et al.

arXiv 2021 · 2021

Working Memory Connections for LSTM (LSTM-WM) augments LSTM gates, Working Memory Connections, peephole connections, and the memory cell so gates see a protected projection of the cell state. On PTB character-level language modeling, LSTM-WM achieves 1.299 BPC with TPTB=150 compared to 1.334 BPC for vanilla LSTM.

Benchmark

Working Memory Graphs

Ricky Loynd, Roland Fernandez et al.

arXiv 2019 · 2019

Working Memory Graphs combines a Core vector, Factor vectors, and persistent Memo vectors processed by a multi layer Transformer to implement shortcut recurrence over past observations. On the Pathfinding task, Working Memory Graphs with Memos nearly matches a full history non recurrent Working Memory Graphs baseline and exceeds a GRU agent by roughly 9.5 percentage points in zero shot quiz accuracy on 24 step episodes.

Benchmark

A Dataset and Architecture for Visual Reasoning with a Working Memory

Guangyu Robert Yang, Igor Ganichev et al.

arXiv 2018 · 2018

COG combines Visual processing, Semantic processing, Visual short-term memory, and a Controller so SYS_NAME can parse instructions, attend over images, and maintain working memory. SYS_NAME reaches 96.8% overall accuracy on CLEVR versus 95.5% for CNN+LSTM+RN, while achieving 93.7% on canonical COG and strong zero-shot task generalization.

Benchmark

Chinese Poetry Generation with a Working Memory Model

Xiaoyuan Yi, Maosong Sun et al.

arXiv 2018 · 2018

Working Memory model combines a topic memory, history memory, local memory, genre embedding, and Topic Trace mechanism inside a GRU encoder–decoder to dynamically read and write salient poem context. On Chinese quatrains, Working Memory model achieves BLEU 1.315 and perplexity 86 versus iPoet’s BLEU 0.425 and perplexity 138.