RAG Benchmark Benchmark Memory Architecture

Learning from Supervision with Semantic and Episodic Memory: A Reflective Approach to Agent Adaptation

AuthorsJackson Hassell, Dan Zhang, Hannah Kim et al.

2025

TL;DR

Learning from Supervision with Semantic and Episodic Memory uses label-grounded critiques stored as episodic and semantic memory to boost accuracy by up to 24.8% over EP_LABEL on Multi-Condition Ranking.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory agents need supervision without fine tuning: critiques add up to 24.8% accuracy

Learning from Supervision with Semantic and Episodic Memory targets settings where fine tuning is costly and opaque, and label-only RAG baselines cap performance, with critiques yielding up to a 24.8% accuracy improvement.

On tasks like Multi Condition Ranking and PubMed, EP_LABEL few shot retrieval saturates, leaving agents brittle and unable to generalize supervision, limiting continual adaptation.

HOW IT WORKS

Learning from Supervision with Semantic and Episodic Memory — episodic critiques plus semantic summaries

Learning from Supervision with Semantic and Episodic Memory uses a performance agent, critic agent, episodic memory, semantic memory, and memory retriever to store label-grounded critiques as reusable supervision.

You can think of episodic memory as fast RAM holding specific question answer critique triples, while semantic memory acts like a distilled card catalog of global reflections and task rules.

This design lets Learning from Supervision with Semantic and Episodic Memory condition on structured critiques and distilled reflections, enabling behaviors that a plain context window with raw labels or few shot examples cannot support.

DIAGRAM

Label driven critique generation and storage pipeline

This diagram shows how Learning from Supervision with Semantic and Episodic Memory generates assertion rationale reflection critiques from labeled data and stores them into episodic and semantic memory.

DIAGRAM

Evaluation pipeline across fact and preference datasets

This diagram shows how Learning from Supervision with Semantic and Episodic Memory evaluates zero_shot, EP_LABEL, EP_CRIT, SEM_CRIT, and EP+SEM_CRIT across fact oriented and preference based datasets.

PROCESS

How Learning from Supervision with Semantic and Episodic Memory Handles a Question

01
Learning from Supervised Signals
Learning from Supervision with Semantic and Episodic Memory uses the performance agent and critic agent on D_init_train to generate assertion rationale reflection critiques for each labeled question answer pair.
02
What to Remember
Learning from Supervision with Semantic and Episodic Memory structures each critique into assertion, rationale, and reflection fields, explicitly restating the correct answer to reduce confirmation bias before storing anything.
03
Incorporating Critiques into Memory
Learning from Supervision with Semantic and Episodic Memory writes instance level critiques into episodic memory and summarizes reflections into semantic memory, creating SEM_CRIT and EP_CRIT representations.
04
Combining Semantic and Episodic Memory
At inference, Learning from Supervision with Semantic and Episodic Memory retrieves top K=5 episodic entries via the memory retriever, concatenates semantic advice, and conditions the performance agent under EP+SEM_CRIT.

KEY CONTRIBUTIONS

Key Contributions

01
Memory augmented framework with critiques
Learning from Supervision with Semantic and Episodic Memory introduces a performance agent, critic agent, episodic memory, semantic memory, and memory retriever pipeline that learns classification functions without parameter updates, achieving up to 24.8% accuracy gain over EP_LABEL.
02
Semantic and episodic critique strategies
Learning from Supervision with Semantic and Episodic Memory defines EP_CRIT, SEM_CRIT, and EP+SEM_CRIT strategies that store assertion rationale reflection critiques as either instance level memories or distilled task level advice.
03
Suggestibility metric
Learning from Supervision with Semantic and Episodic Memory proposes suggestibility S, measuring how performance agents change predictions when given best effort versus flipped label critiques, revealing task dependent receptivity patterns.

RESULTS

By the Numbers

Accuracy Multi Cond Ranking

85.6%

+24.8 over EP_LABEL (Mixtral 8x22B, EP+SEM_CRIT vs EP_LABEL 60.8%)

Accuracy NFCorpus

90.8%

vs EP_LABEL 70.0% (Mixtral 8x22B, EP+SEM_CRIT with gpt-4o-mini critic)

Accuracy Steam Pref

64.7%

EP+SEM_CRIT with o4-mini critic vs EP_LABEL 61.3% for Llama 4 Scout

Suggestibility Steam Pref

100.0%

gpt-4o-mini XY condition shows always following supervision on preference data

Across Multi Condition Ranking, NFCorpus, PubMed, and four preference datasets, Learning from Supervision with Semantic and Episodic Memory tests EP_LABEL, EP_CRIT, SEM_CRIT, and EP+SEM_CRIT. The 24.8 percentage point gain on Multi Condition Ranking with Mixtral 8x22B shows that critique based episodic and semantic memory can substantially improve over label only RAG baselines.

BENCHMARK

By the Numbers

BENCHMARK

Performance agent accuracy across datasets for Mixtral 8x22B on Multi Condition Ranking

Accuracy on Multi Condition Ranking comparing zero_shot, EP_LABEL, EP_CRIT, SEM_CRIT, and EP+SEM_CRIT for Mixtral 8x22B.

KEY INSIGHT

The Counterintuitive Finding

On preference datasets with Mixtral 8x22B and Llama 3.1 8B, EP_LABEL often beats EP_CRIT, despite critiques improving fact oriented tasks by up to 14.4%.

This is surprising because Learning from Supervision with Semantic and Episodic Memory suggests critiques should always help, yet some smaller open source models are less suggestible to critiques than to direct labels.

WHY IT MATTERS

What this unlocks for the field

Learning from Supervision with Semantic and Episodic Memory enables agents to turn labeled supervision plus critiques into reusable episodic and semantic memories without any parameter updates.

Builders can now deploy frozen LLM agents that continuously adapt via structured critiques, tuning behavior per task or user without retraining or storing multiple fine tuned checkpoints.

~13 min read← Back to papers

Related papers

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

arXiv:2601.02428 Read explainer

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

arXiv:2604.06845 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…