A-MBER: Affective Memory Benchmark for Emotion Recognition

AuthorsDeliang Wen, Ke Sun, Yu Wang

2026

TL;DR

A-MBER uses a staged benchmark-construction pipeline with anchor-turn units to test affective memory, showing structured memory systems reach 0.69 judgment accuracy vs 0.34 with no memory.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Emotion benchmarks ignore history grounded affective memory

Existing resources separate emotion datasets and long term memory benchmarks, leaving “limited support” for evaluating history grounded affective interpretation.

In realistic tutoring or counseling, assistants misread present affect when they ignore earlier triggers, failed support attempts, and longer emotional trajectories across sessions.

HOW IT WORKS

A-MBER — staged construction for affective memory

A-MBER uses persona specification, long-horizon planning, conversation generation, annotation, question construction, and benchmark-unit packaging to create anchor-turn centered evaluation items.

Think of A-MBER like a carefully scripted TV season: writers plan the emotional arc first, then shoot scenes, then annotate and cut them into precise test clips.

This staged pipeline lets A-MBER probe history grounded affective reasoning that a plain context window cannot, by tying judgments, retrieval, and explanations to gold evidence at specific anchor turns.

DIAGRAM

Anchor turn interaction and evidence flow

This diagram shows how A-MBER routes an anchor turn through judgment, retrieval, and explanation tasks grounded in multi session history.

DIAGRAM

A-MBER evaluation conditions and memory levels

This diagram shows how A-MBER structures evaluation by context policy, memory dependency level, and robustness condition.

PROCESS

How A-MBER Handles an Anchor Turn Evaluation

01
Benchmark Target and Task Interface
A-MBER defines an anchor turn within multi session interaction and specifies judgment, retrieval, and explanation outputs tied to that moment.
02
Scenario and Data Representation
A-MBER instantiates a teacher or counselor with a student scenario, representing sessions as dialogue text plus structured delivery descriptions.
03
Primary Construction Pipeline
A-MBER runs the single agent pipeline from persona and long horizon planning through conversation generation, annotation, and question construction.
04
Benchmark Tasks and Evaluation Design
A-MBER packages benchmark units, assigns memory levels and reasoning structures, and evaluates systems under local and full history context policies.

KEY CONTRIBUTIONS

Key Contributions

01
Affective memory benchmark target
A-MBER formulates affective memory for emotion recognition as distinct from generic long term factual memory and local emotion recognition, centered on anchor turn interpretation.
02
Structured benchmark construction framework
A-MBER introduces a staged pipeline with long horizon planning, conversation generation, annotation, question construction, and benchmark unit packaging using explicit intermediate representations.
03
Unified evaluation across memory levels
A-MBER provides judgment, retrieval, and explanation tasks with memory levels, reasoning structures, and robustness conditions to study how memory supports affective interpretation over time.

RESULTS

By the Numbers

Judgment

0.69

+0.35 over No-Memory Baseline

Retrieval

0.66

vs No-Memory Baseline at 0.29

Explanation

0.65

0.34 higher than No-Memory Baseline

Implicit affect subset

0.65

0.47 above No-Memory Baseline on implicit affect

On the A-MBER affective memory benchmark, the structured memory system achieves 0.69 judgment, 0.66 retrieval, and 0.65 explanation accuracy. These results show A-MBER rewards history grounded affective reasoning beyond local context baselines.

BENCHMARK

By the Numbers

BENCHMARK

Benchmark: Main comparison across system configurations on A-MBER

Average judgment accuracy on A-MBER core tasks.

KEY INSIGHT

The Counterintuitive Finding

Even with gold supporting evidence provided, A-MBER reports only 0.81 judgment accuracy, far from a trivial near perfect ceiling.

This is surprising because many benchmarks become easy once evidence is given, but A-MBER still demands complex trajectory based affective reasoning.

WHY IT MATTERS

What this unlocks for the field

A-MBER makes it possible to measure whether conversational agents use remembered interaction history to interpret present affect, not just recall facts.

With A-MBER style anchor turn units, builders can design memory systems that explicitly target long range emotional trajectories, calibration under ambiguity, and grounded explanations.

~18 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…