Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi et al.
arXiv 2025 · 2025
LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.