Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

AuthorsBenjamin Stern, Peter Nadel

2026

TL;DR

Drawing on Memory’s dual-trace encoding pairs fact traces with scene traces and yields 73.7% vs 53.5% accuracy (+20.2pp) on LongMemEval-S.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Flat agent memories miss temporal structure and cross-session reasoning (+20.2pp gap)

LLM agents typically store flat factual records, discarding when, where, and why information was learned, which cripples temporal reasoning and aggregation.

On LongMemEval-S, a strong fact-only baseline (C7-control) reaches only 53.5% accuracy, leaving complex temporal, update, and multi-session questions frequently unanswered or wrong.

HOW IT WORKS

Dual-trace memory encoding with evidence scoring and three-state retrieval

Drawing on Memory introduces dual-trace memory encoding, combining fact traces, scene traces, an evidence scoring gate, and a three-state retrieval protocol on top of Letta’s archival memory.

Think of fact traces as index cards and scene traces as richly illustrated postcards; the evidence scoring gate decides which postcards to keep, and the retrieval protocol knows when to trust them.

This KEY_MECHANISM lets Drawing on Memory reconstruct temporal sequences, updates, and cross-session aggregates that a plain context window or flat vector store cannot support.

DIAGRAM

Three-state retrieval protocol for Drawing on Memory

This diagram shows how Drawing on Memory answers a recall question using its three-state retrieval protocol over fact and scene traces.

DIAGRAM

LongMemEval-S evaluation pipeline for Drawing on Memory

This diagram shows how Drawing on Memory is trained on 4,575 sessions then evaluated on 100 LongMemEval-S questions with GPT-4o judging.

PROCESS

How Drawing on Memory Handles a LongMemEval-S Session

  1. 01

    Dual-trace memory encoding

    Drawing on Memory processes a conversation and creates paired fact traces and scene traces, tagging them with shared anchors and YAML metadata.

  2. 02

    Evidence scoring

    Drawing on Memory scores relevance, specificity, and explicitness from 0–2 each, routing sessions to DROP or FULL encoding based on the 0–2 versus 3–6 threshold.

  3. 03

    Three-state retrieval protocol

    When queried, Drawing on Memory searches archival entries and classifies each question into State A, B, or C depending on fact and scene availability.

  4. 04

    LongMemEval-S recall evaluation

    Drawing on Memory answers 100 benchmark questions, with GPT-4o grading correctness across single-session, multi-session, update, temporal, and abstention categories.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Dual-trace memory encoding protocol

    Drawing on Memory defines a deployable dual-trace protocol with fact traces, scene traces, an evidence scoring gate, and a three-state retrieval protocol for Letta’s archival memory.

  • 02

    LongMemEval-S experimental evidence

    Drawing on Memory achieves 73.7% accuracy versus 53.5% for C7-control on 99 shared LongMemEval-S questions, a +20.2pp gain with p < 0.0001.

  • 03

    Coding agent adaptation design

    Drawing on Memory sketches a dual-trace adaptation for Letta Code, targeting debugging incidents, design decisions, and learning progressions with preliminary pilot validation.

RESULTS

By the Numbers

Overall accuracy

73.7%

+20.2pp over C7-control

Temporal reasoning

65%

+40pp over C7-control

Multi-session aggregation

50%

+30pp over C7-control

Knowledge-update tracking

80%

+25pp over C7-control

On LongMemEval-S, which spans 4,575 sessions and 100 recall questions, Drawing on Memory shows that dual-trace encoding materially improves cross-session temporal, aggregation, and update reasoning over a fact-only baseline. The +20.2 percentage point overall gain demonstrates that richer encoding, not just more storage, changes long-term recall behavior.

BENCHMARK

By the Numbers

On LongMemEval-S, which spans 4,575 sessions and 100 recall questions, Drawing on Memory shows that dual-trace encoding materially improves cross-session temporal, aggregation, and update reasoning over a fact-only baseline. The +20.2 percentage point overall gain demonstrates that richer encoding, not just more storage, changes long-term recall behavior.

BENCHMARK

Accuracy on LongMemEval-S by condition

Overall accuracy (%) on LongMemEval-S across memory conditions.

KEY INSIGHT

The Counterintuitive Finding

Drawing on Memory shows zero gain on single-session fact retrieval, with both C6-draw and C7-control at 75% accuracy despite dual-trace encoding.

This is surprising because many assume richer memories help everywhere, but Drawing on Memory reveals that scene traces matter only when temporal or cross-session context is required.

WHY IT MATTERS

What this unlocks for the field

Drawing on Memory unlocks practical, token-efficient episodic memory for agents, enabling temporal reasoning, knowledge-update tracking, and multi-session aggregation at scale.

Builders can now design agents that remember evolving user histories and project narratives across thousands of sessions, without changing models or paying extra token costs.

~13 min read← Back to papers

Related papers

Long-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Questions about this paper?

Paper: Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

Answers use this explainer on Memory Papers.

Checking…