Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

AuthorsChris Latimer, Nicoló Boschi, Andrew Neeser et al.

2025

TL;DR

HINDSIGHT combines TEMPR’s temporal entity memory graph with CARA’s preference-shaped reflection to reach 91.4% on LongMemEval, +31.2 points over full-context GPT-4o.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Agents Blur Evidence and Belief in Long Conversations (39.0% vs 83.6%)

Existing long-horizon agents using full-context OSS-20B achieve only 39.0% overall accuracy on LongMemEval, despite seeing the entire conversation history.

This failure means stateless RAG-style systems cannot maintain epistemic clarity, mis-handle temporal reasoning, and break preference consistency across hundreds of thousands of tokens.

HOW IT WORKS

HINDSIGHT — Four Networks, TEMPR, and CARA

HINDSIGHT’s core mechanism combines TEMPR for retain and recall with CARA for reflect over four networks: world, experience, opinion, and observation.

You can think of HINDSIGHT like a database-backed brain: TEMPR is the hippocampus storing a temporal entity graph, while CARA is the prefrontal cortex applying a configurable personality.

This explicit separation of world facts, experiences, observations, and opinions lets HINDSIGHT track beliefs with confidence scores and behavioral profiles, far beyond what a plain context window can support.

DIAGRAM

Retain, Recall, Reflect Flow in HINDSIGHT

This diagram shows how HINDSIGHT processes data through TEMPR’s retain and recall pipelines and CARA’s reflect loop for a single query.

DIAGRAM

LongMemEval and LoCoMo Evaluation Setup

This diagram shows how HINDSIGHT is evaluated on LongMemEval and LoCoMo with different backbone models and judge configurations.

PROCESS

How HINDSIGHT Handles a LongMemEval Question

01
Retain with TEMPR
TEMPR ingests conversational transcripts, performs LLM based narrative fact extraction, and organizes facts into the world, experience, opinion, and observation networks.
02
Build Temporal Entity Memory Graph
TEMPR constructs an entity aware memory graph with temporal, semantic, entity, and causal links connecting memory units across the four networks.
03
Recall with Multi Strategy Retrieval
TEMPR runs semantic, BM25, graph, and temporal retrieval in parallel, fuses them with Reciprocal Rank Fusion, reranks with a cross encoder, and respects a token budget.
04
Reflect with CARA
CARA combines retrieved memories with a behavioral profile, generates a preference conditioned response, and updates the opinion network via opinion formation and reinforcement.

KEY CONTRIBUTIONS

Key Contributions

01
Unified Memory Architecture for Agents
HINDSIGHT introduces a four network memory bank separating world, experience, observation, and opinion facts, enabling epistemic clarity and traceable belief updates across long horizons.
02
Retain, Recall, and Reflect Layers
HINDSIGHT implements retain and recall via TEMPR’s temporal entity memory graph and reflect via CARA’s preference aware reasoning, turning conversational streams into a structured, queryable memory bank.
03
Empirical Evaluation on Long Horizon Benchmarks
HINDSIGHT reaches 83.6% on LongMemEval with OSS 20B and 89.61% on LoCoMo with Gemini 3, improving LongMemEval OSS 20B full context from 39.0% to 83.6%.

RESULTS

By the Numbers

Overall

83.6%

+44.6 over Full-context OSS-20B (39.0%)

multi-session

79.7%

+58.6 over Full-context OSS-20B (21.1%)

temporal-reasoning

79.7%

+48.1 over Full-context OSS-20B (31.6%)

LoCoMo Overall

89.61%

+13.83 over Memobase (75.78%)

On LongMemEval S, which tests information extraction, multi session reasoning, temporal reasoning, knowledge update, and abstention, HINDSIGHT with OSS 20B lifts overall accuracy from 39.0% to 83.6%. On LoCoMo’s multi session human conversations, HINDSIGHT with Gemini 3 reaches 89.61% overall, surpassing the strongest prior open system Memobase at 75.78%.

BENCHMARK

By the Numbers

BENCHMARK

LongMemEval S Overall Accuracy Comparison

Overall accuracy (%) on LongMemEval S across HINDSIGHT and baselines.

BENCHMARK

LoCoMo Overall Accuracy Comparison

Overall accuracy (%) on LoCoMo for HINDSIGHT and prior memory systems.

KEY INSIGHT

The Counterintuitive Finding

HINDSIGHT with an open source 20B model reaches 83.6% on LongMemEval, beating full context GPT 4o at 60.2% by 23.4 points.

This is surprising because a much smaller open backbone, when paired with structured memory and reflection, surpasses a frontier model that sees the entire context window.

WHY IT MATTERS

What this unlocks for the field

HINDSIGHT unlocks long lived agents that track world facts, experiences, and evolving opinions with explicit confidence and behavioral profiles.

Builders can now deploy open source agents that maintain consistent preferences, explain their reasoning, and scale to million token histories without relying solely on massive frontier context windows.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…