Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

AuthorsHyungho Na, Yunkyeong Seo, Il-chul Moon

2024

TL;DR

EMU uses a deterministic conditional autoencoder plus an episodic incentive on desirable trajectories to accelerate cooperative MARL on SMAC and GRF over QPLEX and CDS baselines.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Cooperative MARL gets stuck in local optima and learns too slowly

Existing cooperative MARL with episodic control often converges to local optima and requires significant learning time on complex tasks like SMAC and GRF.

Random projection based episodic memory recalls only nearly identical states, so EMU’s target tasks suffer from poor exploration and fail to discover goal-reaching policies.

HOW IT WORKS

Efficient episodic Memory Utilization with semantic embeddings and episodic incentive

EMU’s core mechanism combines semantic memory embedding, deterministic conditional autoencoder dCAE, episodic incentive, and the episodic buffer DE on top of value factorization MARL.

You can think of EMU as giving agents a structured hippocampus plus a bonus system, where dCAE organizes memories and episodic incentive rewards revisiting promising paths.

This KEY_MECHANISM lets EMU explore semantically nearby high-return states in embedding space and selectively boost desirable transitions, which a plain context window or naive episodic control cannot.

DIAGRAM

Semantic memory embedding and recall pipeline in EMU

This diagram shows how EMU encodes states with dCAE, updates the episodic buffer, and recalls semantically similar memories for training.

DIAGRAM

Training loop with episodic incentive and desirability in EMU

This diagram shows how EMU labels desirable trajectories, computes episodic incentive rp, and updates Qtot during training.

PROCESS

How EMU Handles a Cooperative MARL Episode

  1. 01

    Episodic Memory Construction

    EMU collects transitions into the episodic buffer DE, storing global state s, highest return H, embedding x from fϕ, and desirability ξ for each timestep.

  2. 02

    Semantic Memory Embedding

    EMU trains the deterministic conditional autoencoder dCAE with encoder fϕ and decoder fψ to predict H and reconstruct s, shaping a smooth, return-aware embedding space.

  3. 03

    Episodic Incentive Generation

    EMU uses desirability ξ, visit counts, and H in DE to estimate η̂ and compute episodic incentive rp = γ η̂(s′) for desirable transitions only.

  4. 04

    Value Factorization Learning

    EMU plugs rp into the Q-learning loss for Qtot within the value factorization framework, jointly training individual Qi and the mixing network on SMAC and GRF tasks.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Efficient memory embedding

    EMU introduces a trainable state embedding fϕ with dCAE that predicts highest return H and reconstructs s, yielding semantically clustered episodic memory for better recall.

  • 02

    Episodic incentive generation

    EMU defines desirability ξ and an episodic incentive rp = γ η̂(s′) that selectively rewards desirable transitions and provably converges to the optimal gradient signal.

  • 03

    Improved cooperative MARL on SMAC and GRF

    EMU, instantiated as EMU QPLEX and EMU CDS, accelerates convergence and increases win-rates on hard and super hard SMAC and GRF benchmarks compared to QMIX, QPLEX, CDS, and EMC.

RESULTS

By the Numbers

Test win rate

Higher on 3s_vs_5z SMAC

EMU QPLEX exceeds QPLEX and EMC on 3s_vs_5z

Test win rate

Higher on MMM2 SMAC

EMU QPLEX and EMU CDS beat CDS and QPLEX on MMM2

Test win rate

Higher on 6h_vs_8z SMAC

EMU variants converge faster than EMC on 6h_vs_8z

Goal scoring rate

Higher on GRF CA_hard

EMU finds scoring policies earlier than QPLEX and CDS on GRF

The benchmarks are StarCraft II Multi-agent Challenge maps and Google Research Football scenarios, testing cooperative coordination under partial observability. The MAIN_RESULT shows that EMU’s semantic memory and episodic incentive improve both learning speed and final performance over value factorization baselines with conventional episodic control.

BENCHMARK

By the Numbers

The benchmarks are StarCraft II Multi-agent Challenge maps and Google Research Football scenarios, testing cooperative coordination under partial observability. The MAIN_RESULT shows that EMU’s semantic memory and episodic incentive improve both learning speed and final performance over value factorization baselines with conventional episodic control.

BENCHMARK

Performance comparison of EMU against baseline algorithms on SMAC and GRF

Relative test win-rate and scoring performance of EMU versus QMIX, QPLEX, CDS, and EMC on hard and super hard cooperative tasks.

KEY INSIGHT

The Counterintuitive Finding

EMU’s episodic incentive can be applied without manual scaling across task difficulty, unlike conventional episodic control that must be nearly disabled on super hard SMAC maps.

This is surprising because episodic bonuses are usually tuned per environment, but EMU’s desirability based rp automatically avoids overemphasizing early local optima.

WHY IT MATTERS

What this unlocks for the field

EMU unlocks semantically aware episodic memory for cooperative MARL, enabling agents to explore promising neighborhoods in state space instead of replaying identical states.

Builders can now bolt EMU onto value factorization methods like QPLEX or CDS to get faster, more reliable convergence on complex multi-agent tasks without fragile episodic control tuning.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Efficient Episodic Memory Utilization of Cooperative Multi-Agent Reinforcement Learning

Answers use this explainer on Memory Papers.

Checking…