Pre-Storage Reasoning for Episodic Memory: Shifting Inference Burden to Memory for Personalized Dialogue

AuthorsSangyeop Kim, Yohan Lee, Sanghwa Kim et al.

2025

TL;DR

PREMem shifts cross session reasoning into pre storage memory construction using episodic extraction and schema based evolution patterns, reaching 71.4 LLM judge on LongMemEval with gpt 4.1 base.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Cross session dialogue reasoning collapses when inference bears all burden

PREMem targets conversational systems where response generation carries excessive reasoning burden, making performance highly dependent on model sizes and struggling with temporal integration across sessions.

When multi session personalized dialogue requires continuity, cross session reasoning and temporal relationships break down, leading to inconsistent user modeling and degraded personalization.

HOW IT WORKS

PREMem — Pre storage Reasoning for Episodic Memory

PREMem’s core mechanism chains Episodic Memory Extraction, Pre Storage Memory Reasoning, semantic clustering with a persistent memory pool, and an Inference Phase that retrieves from both M and R.

You can think of PREMem like a brain that consolidates experiences during sleep, turning raw conversations into structured schemas before they are shelved in long term memory.

By shifting schema based evolution reasoning into storage, PREMem lets inference operate on pre synthesized episodic fragments and evolution patterns that a plain context window could never expose explicitly.

DIAGRAM

Inference Flow over PREMem Memories

This diagram shows how PREMem retrieves and composes enriched episodic and reasoning memories during the Inference Phase for a new query.

DIAGRAM

PREMem Evaluation and Ablation Pipeline

This diagram shows how PREMem is evaluated on LongMemEval and LoCoMo with different models, baselines, and ablation settings.

PROCESS

How PREMem Handles a Multi Session Personalized Query

01
Step 1 Episodic Memory Extraction
PREMem runs Episodic Memory Extraction to convert conversation sessions S1 to SN into factual, experiential, and subjective fragments with structured temporal context.
02
Step 2 Pre Storage Memory Reasoning
PREMem applies Pre Storage Memory Reasoning by clustering fragments, using a persistent memory pool, and generating reasoning memories via LLMreason over evolution patterns.
03
Clustering and Temporal Linking
PREMem embeds fragments with femb, forms clusters Ci, maintains a persistent memory pool Pi, and links clusters across sessions using cosine similarity and threshold θ.
04
Inference Phase
In the Inference Phase, PREMem retrieves top k items from memory storage M and reasoning storage R, orders them chronologically, and feeds them with the query into LLMresponse.

KEY CONTRIBUTIONS

Key Contributions

01
Cognitive science grounded memory framework
PREMem structures episodic memories via Episodic Memory Extraction into factual, experiential, and subjective categories, modeling five evolution patterns across sessions inspired by schema theory.
02
Pre storage reasoning for cross session synthesis
PREMem’s Pre Storage Memory Reasoning shifts complex cross session synthesis into storage time using clustering, a persistent memory pool, and LLMreason to generate reasoning memory fragments.
03
Robust multi benchmark validation and efficiency
PREMem achieves up to 71.4 LLM judge on LongMemEval and 67.7 on LoCoMo with gpt 4.1 base, while maintaining strong performance under tight token budgets and with small models.

RESULTS

By the Numbers

LLM as a judge

71.4 score

+15.5 over HippoRAG-2 with gpt-4.1 base on LongMemEval

ROUGE-1

44.6 score

+13.3 over HippoRAG-2 with gpt-4.1 base on LongMemEval

LLM as a judge

67.7 score

+10.4 over HippoRAG-2 with gpt-4.1 base on LoCoMo

LLM as a judge

58.7 score

+13.5 over Turn with gpt-4.1 on LongMemEval using gpt-4.1 nano in PREMem

On LongMemEval and LoCoMo, which stress long term personalized QA with temporal and adversarial queries, PREMem’s gains show that pre storage reasoning can close or reverse gaps between small and large models.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison across different model sizes and memory frameworks

LLM as a judge on LongMemEval with gpt 4.1 base as LLMresponse.

BENCHMARK

Small models with PREMem vs larger models with baselines

LLM as a judge on LongMemEval comparing PREMem small models to larger baseline models.

KEY INSIGHT

The Counterintuitive Finding

PREMem with gpt 4.1 nano reaches 58.7 LLM judge on LongMemEval, beating Turn with full gpt 4.1 at 40.7 by 18.0 points.

This is surprising because it overturns the usual assumption that only larger inference models can handle complex long term reasoning, showing storage time reasoning can compensate for smaller capacities.

WHY IT MATTERS

What this unlocks for the field

PREMem unlocks long horizon, cross session personalization where temporal reasoning and preference evolution are encoded directly into memory rather than recomputed each turn.

Builders can now deploy smaller, cheaper models that still handle complex multi session QA, by investing compute in offline PREMem construction instead of scaling inference time architectures.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…