With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

AuthorsManuele Barraco, Sara Sarto, Marcella Cornia et al.

arXiv 20232023

TL;DR

PMA-Net uses Prototypical Memory Attention over clustered past activations to boost COCO CIDEr from 127.8 to 131.5 (+3.7) over a Transformer baseline.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Captioners Ignore Knowledge Beyond the Current Sample, Losing +3.7 CIDEr on COCO

Standard Transformer captioners only attend within a single image-caption pair, ignoring semantic cues from other training samples.

On COCO, this limitation keeps a strong Transformer at 127.8 CIDEr, while PMA-Net reaches 131.5 CIDEr by exploiting past activations and prototypical memories.

HOW IT WORKS

Prototypical Memory Attention for Image Captioning

PMA-Net introduces memory banks, prototype keys, prototype values, and segment embeddings inside each decoder self-attention layer to attend over past activations.

You can think of PMA-Net like a captioner with a rolling RAM cache that clusters its own past thoughts into compact, reusable prototypes.

This prototypical memory lets PMA-Net retrieve and combine experience from other images, something a plain context window over a single sample cannot express.

DIAGRAM

Memory Bank Update and Prototype Generation Pipeline

This diagram shows how PMA-Net collects past keys and values into memory banks and periodically generates prototype keys and values using K-Means and k-NN.

DIAGRAM

Training Loop and Ablation Design for PMA-Net

This diagram shows how PMA-Net's training loop integrates memory updates, prototype generation, and ablation variants on COCO.

PROCESS

How PMA-Net Handles Image Captioning

01
Memories as banks of past activations
PMA-Net processes mini batches and stores decoder self-attention keys and values into memory banks BK and BV over a temporal window T.
02
Building memory prototypes
PMA-Net runs K-Means over BK to obtain prototype keys and uses k-NN over BK and BV to interpolate prototype values.
03
Memory-augmented attention
PMA-Net concatenates prototype keys and values with current keys and values, using segment embeddings to distinguish memory from input tokens.
04
Memory bank update
Using a strided sliding window, PMA-Net refreshes BK and BV twice per epoch, regenerating prototypes while keeping training stable.

KEY CONTRIBUTIONS

Key Contributions

01
Prototypical Memory Attention
PMA-Net integrates Prototypical Memory Attention into decoder self-attention, letting queries attend prototype keys and values derived from past activations instead of static learnable vectors.
02
Memories as banks of past activations
PMA-Net defines memory banks BK and BV that store keys and values from recent training iterations, modeling the manifold of past decoder activations.
03
Building memory prototypes
PMA-Net introduces fast K-Means and k-NN based prototype keys and prototype values, yielding up to +3.7 CIDEr over a Transformer on COCO Karpathy test.

RESULTS

By the Numbers

CIDEr

131.5

+3.7 over Transformer† (127.8) under cross-entropy on COCO Karpathy test

BLEU-4

39.5

+2.1 over Transformer† (37.4) under cross-entropy on COCO Karpathy test

CIDEr SCST

144.1

+3.8 over Transformer† (140.3) under CIDEr optimization on COCO Karpathy test

COCO Test CIDEr c40

143.4

+3.4 over CaMEL (140.0) on COCO online test server with 40 references

On the COCO Karpathy test split, PMA-Net improves CIDEr from 127.8 to 131.5 over a re-trained Transformer† and reaches 144.1 CIDEr after CIDEr optimization. On the COCO online test server, PMA-Net attains 143.4 CIDEr c40, surpassing CaMEL and COS-Net while using fixed CLIP ViT-L/14 features.

BENCHMARK

By the Numbers

BENCHMARK

COCO Karpathy Test under Cross-Entropy Training (CIDEr)

CIDEr scores on the COCO Karpathy test split with cross-entropy training and CLIP ViT-L/14 features.

BENCHMARK

Ablation on Memory Size m and Bank Size T (CIDEr)

CIDEr scores for PMA-Net ablations on COCO Karpathy validation with different prototype counts m and memory bank sizes T.

KEY INSIGHT

The Counterintuitive Finding

Despite adding extra attention keys, PMA-Net reduces hallucination on robust COCO, cutting CHi from 2.8 to 2.6 compared to Transformer†.

You might expect more stored experiences to amplify hallucinations, but PMA-Net's prototypes instead regularize attention, slightly lowering CHs from 4.6 to 4.3 while increasing CIDEr from 119.6 to 122.0.

WHY IT MATTERS

What this unlocks for the field

PMA-Net shows that Transformer captioners can treat their own past activations as a compressed, queryable experience base via prototypical memory.

Builders can now design vision-language systems that reuse clustered training-time activations at inference, gaining +3.7 to +3.8 CIDEr without external databases or finetuning visual backbones.

~13 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…