Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

AuthorsBenjamin Stern, Peter Nadel

2026

TL;DR

Drawing on Memory’s dual-trace encoding pairs fact traces with scene traces and yields 73.7% vs 53.5% accuracy (+20.2pp) on LongMemEval-S.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Flat agent memories miss temporal structure and cross-session reasoning (+20.2pp gap)

LLM agents typically store flat factual records, discarding when, where, and why information was learned, which cripples temporal reasoning and aggregation.

On LongMemEval-S, a strong fact-only baseline (C7-control) reaches only 53.5% accuracy, leaving complex temporal, update, and multi-session questions frequently unanswered or wrong.

HOW IT WORKS

Dual-trace memory encoding with evidence scoring and three-state retrieval

Drawing on Memory introduces dual-trace memory encoding, combining fact traces, scene traces, an evidence scoring gate, and a three-state retrieval protocol on top of Letta’s archival memory.

Think of fact traces as index cards and scene traces as richly illustrated postcards; the evidence scoring gate decides which postcards to keep, and the retrieval protocol knows when to trust them.

This KEY_MECHANISM lets Drawing on Memory reconstruct temporal sequences, updates, and cross-session aggregates that a plain context window or flat vector store cannot support.

DIAGRAM

Three-state retrieval protocol for Drawing on Memory

This diagram shows how Drawing on Memory answers a recall question using its three-state retrieval protocol over fact and scene traces.

DIAGRAM

LongMemEval-S evaluation pipeline for Drawing on Memory

This diagram shows how Drawing on Memory is trained on 4,575 sessions then evaluated on 100 LongMemEval-S questions with GPT-4o judging.

PROCESS

How Drawing on Memory Handles a LongMemEval-S Session

01
Dual-trace memory encoding
Drawing on Memory processes a conversation and creates paired fact traces and scene traces, tagging them with shared anchors and YAML metadata.
02
Evidence scoring
Drawing on Memory scores relevance, specificity, and explicitness from 0–2 each, routing sessions to DROP or FULL encoding based on the 0–2 versus 3–6 threshold.
03
Three-state retrieval protocol
When queried, Drawing on Memory searches archival entries and classifies each question into State A, B, or C depending on fact and scene availability.
04
LongMemEval-S recall evaluation
Drawing on Memory answers 100 benchmark questions, with GPT-4o grading correctness across single-session, multi-session, update, temporal, and abstention categories.

KEY CONTRIBUTIONS

Key Contributions

01
Dual-trace memory encoding protocol
Drawing on Memory defines a deployable dual-trace protocol with fact traces, scene traces, an evidence scoring gate, and a three-state retrieval protocol for Letta’s archival memory.
02
LongMemEval-S experimental evidence
Drawing on Memory achieves 73.7% accuracy versus 53.5% for C7-control on 99 shared LongMemEval-S questions, a +20.2pp gain with p < 0.0001.
03
Coding agent adaptation design
Drawing on Memory sketches a dual-trace adaptation for Letta Code, targeting debugging incidents, design decisions, and learning progressions with preliminary pilot validation.

RESULTS

By the Numbers

Overall accuracy

73.7%

+20.2pp over C7-control

Temporal reasoning

65%

+40pp over C7-control

Multi-session aggregation

50%

+30pp over C7-control

Knowledge-update tracking

80%

+25pp over C7-control

On LongMemEval-S, which spans 4,575 sessions and 100 recall questions, Drawing on Memory shows that dual-trace encoding materially improves cross-session temporal, aggregation, and update reasoning over a fact-only baseline. The +20.2 percentage point overall gain demonstrates that richer encoding, not just more storage, changes long-term recall behavior.

BENCHMARK

By the Numbers

BENCHMARK

Accuracy on LongMemEval-S by condition

Overall accuracy (%) on LongMemEval-S across memory conditions.

KEY INSIGHT

The Counterintuitive Finding

Drawing on Memory shows zero gain on single-session fact retrieval, with both C6-draw and C7-control at 75% accuracy despite dual-trace encoding.

This is surprising because many assume richer memories help everywhere, but Drawing on Memory reveals that scene traces matter only when temporal or cross-session context is required.

WHY IT MATTERS

What this unlocks for the field

Drawing on Memory unlocks practical, token-efficient episodic memory for agents, enabling temporal reasoning, knowledge-update tracking, and multi-session aggregation at scale.

Builders can now design agents that remember evolving user histories and project narratives across thousands of sessions, without changing models or paying extra token costs.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…