A-MBER: Affective Memory Benchmark for Emotion Recognition

AuthorsDeliang Wen, Ke Sun, Yu Wang

2026

TL;DR

A-MBER uses a staged benchmark-construction pipeline with anchor-turn units to test affective memory, showing structured memory systems reach 0.69 judgment accuracy vs 0.34 with no memory.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Emotion benchmarks ignore history grounded affective memory

Existing resources separate emotion datasets and long term memory benchmarks, leaving “limited support” for evaluating history grounded affective interpretation.

In realistic tutoring or counseling, assistants misread present affect when they ignore earlier triggers, failed support attempts, and longer emotional trajectories across sessions.

HOW IT WORKS

A-MBER — staged construction for affective memory

A-MBER uses persona specification, long-horizon planning, conversation generation, annotation, question construction, and benchmark-unit packaging to create anchor-turn centered evaluation items.

Think of A-MBER like a carefully scripted TV season: writers plan the emotional arc first, then shoot scenes, then annotate and cut them into precise test clips.

This staged pipeline lets A-MBER probe history grounded affective reasoning that a plain context window cannot, by tying judgments, retrieval, and explanations to gold evidence at specific anchor turns.

DIAGRAM

Anchor turn interaction and evidence flow

This diagram shows how A-MBER routes an anchor turn through judgment, retrieval, and explanation tasks grounded in multi session history.

DIAGRAM

A-MBER evaluation conditions and memory levels

This diagram shows how A-MBER structures evaluation by context policy, memory dependency level, and robustness condition.

PROCESS

How A-MBER Handles an Anchor Turn Evaluation

  1. 01

    Benchmark Target and Task Interface

    A-MBER defines an anchor turn within multi session interaction and specifies judgment, retrieval, and explanation outputs tied to that moment.

  2. 02

    Scenario and Data Representation

    A-MBER instantiates a teacher or counselor with a student scenario, representing sessions as dialogue text plus structured delivery descriptions.

  3. 03

    Primary Construction Pipeline

    A-MBER runs the single agent pipeline from persona and long horizon planning through conversation generation, annotation, and question construction.

  4. 04

    Benchmark Tasks and Evaluation Design

    A-MBER packages benchmark units, assigns memory levels and reasoning structures, and evaluates systems under local and full history context policies.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Affective memory benchmark target

    A-MBER formulates affective memory for emotion recognition as distinct from generic long term factual memory and local emotion recognition, centered on anchor turn interpretation.

  • 02

    Structured benchmark construction framework

    A-MBER introduces a staged pipeline with long horizon planning, conversation generation, annotation, question construction, and benchmark unit packaging using explicit intermediate representations.

  • 03

    Unified evaluation across memory levels

    A-MBER provides judgment, retrieval, and explanation tasks with memory levels, reasoning structures, and robustness conditions to study how memory supports affective interpretation over time.

RESULTS

By the Numbers

Judgment

0.69

+0.35 over No-Memory Baseline

Retrieval

0.66

vs No-Memory Baseline at 0.29

Explanation

0.65

0.34 higher than No-Memory Baseline

Implicit affect subset

0.65

0.47 above No-Memory Baseline on implicit affect

On the A-MBER affective memory benchmark, the structured memory system achieves 0.69 judgment, 0.66 retrieval, and 0.65 explanation accuracy. These results show A-MBER rewards history grounded affective reasoning beyond local context baselines.

BENCHMARK

By the Numbers

On the A-MBER affective memory benchmark, the structured memory system achieves 0.69 judgment, 0.66 retrieval, and 0.65 explanation accuracy. These results show A-MBER rewards history grounded affective reasoning beyond local context baselines.

BENCHMARK

Benchmark: Main comparison across system configurations on A-MBER

Average judgment accuracy on A-MBER core tasks.

KEY INSIGHT

The Counterintuitive Finding

Even with gold supporting evidence provided, A-MBER reports only 0.81 judgment accuracy, far from a trivial near perfect ceiling.

This is surprising because many benchmarks become easy once evidence is given, but A-MBER still demands complex trajectory based affective reasoning.

WHY IT MATTERS

What this unlocks for the field

A-MBER makes it possible to measure whether conversational agents use remembered interaction history to interpret present affect, not just recall facts.

With A-MBER style anchor turn units, builders can design memory systems that explicitly target long range emotional trajectories, calibration under ambiguity, and grounded explanations.

~18 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Long-Term Memory

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao, Xiang Yu et al.

· 2026

AlpsBench combines Personalized Information Extraction, Personalized Information Update, Personalized Information Retrieval, and Personalized Information Utilization over 2,500 WildChat dialogues with human-verified structured memories. AlpsBench shows, for example, that Gemini-3 Flash scores 51.67 on Task 1 Extraction while DeepSeek Reasoner reaches 0.9569 retrieval recall with 100 distractors on AlpsBench.

Questions about this paper?

Paper: A-MBER: Affective Memory Benchmark for Emotion Recognition

Answers use this explainer on Memory Papers.

Checking…