RAG Benchmark Benchmark Long-Term Memory

RGMem: Renormalization Group-inspired Memory Evolution for Language Agents

AuthorsAo Tian, Yunfeng Lu, Xinxin Fan et al.

2025

TL;DR

RGMem uses renormalization group–style multi-scale memory evolution with operators RK1–RK3 to achieve +8.98 points over Memory OS on PersonaMem.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-Term Dialog Personalization Without Multi-Scale Memory Fails (7.08 and 8.98 Point Gaps)

RGMem targets the gap where existing memory systems operate only at the fact level, making it difficult to distill stable preferences and deep user traits.

On LOCOMO and PersonaMem, this limitation leaves cross-session continuity weak, while RGMem improves performance by 7.08 and 8.98 points over the best baselines.

HOW IT WORKS

RGMem — Renormalization Group–Inspired Multi-Scale Memory Evolution

RGMem centers on Microscopic Evidence Space DL0, Structured Knowledge Space G, and renormalization operators RK1, RK2, and RK3 to evolve user profiles across scales.

You can think of RGMem like a memory hierarchy: DL0 is fast RAM for episodic facts, G is a structured index, and RK1–RK3 act like scheduled compaction and rebalancing.

This renormalization-inspired design lets RGMem reorganize user traits via thresholded phase-transition-like updates, something a plain context window or flat RAG cannot provide.

DIAGRAM

Multi-Scale Retrieval and Response Flow in RGMem

This diagram shows how RGMem retrieves microscopic, mesoscopic, and macroscopic memory to answer a query via the L2 multi-scale retrieval layer.

DIAGRAM

Evaluation Pipeline on LOCOMO and PersonaMem

This diagram shows how RGMem is evaluated on LOCOMO and PersonaMem with different backbones and baselines.

PROCESS

How RGMem Handles a Multi-Session Conversation

01
Construction of Memory State Space
RGMem uses Microscopic Evidence Space DL0 to segment and synthesize raw dialogue into episodic units and then builds Structured Knowledge Space G via the hierarchical extraction function fextract.
02
Instantiation of RG Operators
RGMem applies Relation Inference Operator RK1 to aggregate relation evidence, Node Level Abstraction Operator RK2 to form concept theories, and Hierarchical Flow Operator RK3 to propagate summaries upward.
03
Dynamics and Multi Scale Observations
RGMem runs its thresholded updates, letting Σ and Δ stabilize or reorganize, and exposes these multi scale states through the L2 multi scale retrieval function fretr.
04
Context Aggregation and Output
RGMem combines microscopic evidence, mesoscopic Te, and macroscopic Σ and Δ into a query specific context C(q), which the backbone LLM uses to generate the final response.

KEY CONTRIBUTIONS

Key Contributions

01
Renormalization Group–Inspired Memory Evolution
RGMem formalizes user profiling as a multi scale effective theory T(M, s) and instantiates RK1, RK2, and RK3 over Microscopic Evidence Space DL0 and Structured Knowledge Space G, improving PersonaMem Avg. by 8.98 points.
02
Thresholded Phase Transition Dynamics
RGMem introduces evolution thresholds like θinf that induce phase transition like behavior, with a critical point at θinf = 3 on both LOCOMO and PersonaMem.
03
Resolution of Stability Plasticity Dilemma
RGMem separates fast variables in DL0 from slow variables Σ and Δ in G, breaking the baseline frontier on PersonaMem by jointly improving Recall Facts and Latest Preference scores.

RESULTS

By the Numbers

Avg.

74.01%

+8.98 over Memory OS

Recall

88.64%

+6.07 over A-Mem on PersonaMem

Latest Pref.

83.02%

+9.36 over Memory OS on PersonaMem

Avg.

78.92%

+3.78 over Zep on LOCOMO with gpt 4o mini

On PersonaMem with GPT-4.1, which tests dynamic persona evolution and conflicting evidence, RGMem reaches 74.01% Avg., 88.64% Recall, and 83.02% Latest Preference. On LOCOMO, which targets long-context reasoning and temporal consistency, RGMem achieves 78.92% Avg. with gpt-4o-mini, showing that multi-scale memory evolution yields stronger cross-session continuity than flat retrieval systems.

BENCHMARK

By the Numbers

BENCHMARK

PersonaMem Benchmark Results with GPT-4.1 Backbone

Avg. score on PersonaMem across memory systems using GPT-4.1.

BENCHMARK

LOCOMO Benchmark Results with gpt-4o-mini

Avg. score on LOCOMO question types using gpt-4o-mini.

KEY INSIGHT

The Counterintuitive Finding

RGMem reaches peak performance when the evolution threshold θinf equals 3, with LOCOMO accuracy around 86 and PersonaMem Latest Preference around 84.

This is surprising because increasing or decreasing θinf away from 3 reduces accuracy, contradicting the intuition that more frequent or rarer updates should monotonically help.

WHY IT MATTERS

What this unlocks for the field

RGMem enables language agents to maintain macroscopic user traits Σ while flexibly encoding tensions Δ, even under long, conflicting interaction histories.

Builders can now design agents that both remember long-term goals and adapt to short-term states without retraining or unbounded context windows, using RGMem as a drop in memory backend.

~14 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…