Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

AuthorsHyungho Na, Yunkyeong Seo, Il-chul Moon

2024

TL;DR

EMU uses a deterministic conditional autoencoder plus an episodic incentive on desirable trajectories to accelerate cooperative MARL on SMAC and GRF over QPLEX and CDS baselines.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Cooperative MARL gets stuck in local optima and learns too slowly

Existing cooperative MARL with episodic control often converges to local optima and requires significant learning time on complex tasks like SMAC and GRF.

Random projection based episodic memory recalls only nearly identical states, so EMU’s target tasks suffer from poor exploration and fail to discover goal-reaching policies.

HOW IT WORKS

Efficient episodic Memory Utilization with semantic embeddings and episodic incentive

EMU’s core mechanism combines semantic memory embedding, deterministic conditional autoencoder dCAE, episodic incentive, and the episodic buffer DE on top of value factorization MARL.

You can think of EMU as giving agents a structured hippocampus plus a bonus system, where dCAE organizes memories and episodic incentive rewards revisiting promising paths.

This KEY_MECHANISM lets EMU explore semantically nearby high-return states in embedding space and selectively boost desirable transitions, which a plain context window or naive episodic control cannot.

DIAGRAM

Semantic memory embedding and recall pipeline in EMU

This diagram shows how EMU encodes states with dCAE, updates the episodic buffer, and recalls semantically similar memories for training.

DIAGRAM

Training loop with episodic incentive and desirability in EMU

This diagram shows how EMU labels desirable trajectories, computes episodic incentive rp, and updates Qtot during training.

PROCESS

How EMU Handles a Cooperative MARL Episode

01
Episodic Memory Construction
EMU collects transitions into the episodic buffer DE, storing global state s, highest return H, embedding x from fϕ, and desirability ξ for each timestep.
02
Semantic Memory Embedding
EMU trains the deterministic conditional autoencoder dCAE with encoder fϕ and decoder fψ to predict H and reconstruct s, shaping a smooth, return-aware embedding space.
03
Episodic Incentive Generation
EMU uses desirability ξ, visit counts, and H in DE to estimate η̂ and compute episodic incentive rp = γ η̂(s′) for desirable transitions only.
04
Value Factorization Learning
EMU plugs rp into the Q-learning loss for Qtot within the value factorization framework, jointly training individual Qi and the mixing network on SMAC and GRF tasks.

KEY CONTRIBUTIONS

Key Contributions

01
Efficient memory embedding
EMU introduces a trainable state embedding fϕ with dCAE that predicts highest return H and reconstructs s, yielding semantically clustered episodic memory for better recall.
02
Episodic incentive generation
EMU defines desirability ξ and an episodic incentive rp = γ η̂(s′) that selectively rewards desirable transitions and provably converges to the optimal gradient signal.
03
Improved cooperative MARL on SMAC and GRF
EMU, instantiated as EMU QPLEX and EMU CDS, accelerates convergence and increases win-rates on hard and super hard SMAC and GRF benchmarks compared to QMIX, QPLEX, CDS, and EMC.

RESULTS

By the Numbers

Test win rate

Higher on 3s_vs_5z SMAC

EMU QPLEX exceeds QPLEX and EMC on 3s_vs_5z

Test win rate

Higher on MMM2 SMAC

EMU QPLEX and EMU CDS beat CDS and QPLEX on MMM2

Test win rate

Higher on 6h_vs_8z SMAC

EMU variants converge faster than EMC on 6h_vs_8z

Goal scoring rate

Higher on GRF CA_hard

EMU finds scoring policies earlier than QPLEX and CDS on GRF

The benchmarks are StarCraft II Multi-agent Challenge maps and Google Research Football scenarios, testing cooperative coordination under partial observability. The MAIN_RESULT shows that EMU’s semantic memory and episodic incentive improve both learning speed and final performance over value factorization baselines with conventional episodic control.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison of EMU against baseline algorithms on SMAC and GRF

Relative test win-rate and scoring performance of EMU versus QMIX, QPLEX, CDS, and EMC on hard and super hard cooperative tasks.

KEY INSIGHT

The Counterintuitive Finding

EMU’s episodic incentive can be applied without manual scaling across task difficulty, unlike conventional episodic control that must be nearly disabled on super hard SMAC maps.

This is surprising because episodic bonuses are usually tuned per environment, but EMU’s desirability based rp automatically avoids overemphasizing early local optima.

WHY IT MATTERS

What this unlocks for the field

EMU unlocks semantically aware episodic memory for cooperative MARL, enabling agents to explore promising neighborhoods in state space instead of replaying identical states.

Builders can now bolt EMU onto value factorization methods like QPLEX or CDS to get faster, more reliable convergence on complex multi-agent tasks without fragile episodic control tuning.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…