With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

AuthorsManuele Barraco, Sara Sarto, Marcella Cornia et al.

arXiv 20232023

TL;DR

PMA-Net uses Prototypical Memory Attention over clustered past activations to boost COCO CIDEr from 127.8 to 131.5 (+3.7) over a Transformer baseline.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Captioners Ignore Knowledge Beyond the Current Sample, Losing +3.7 CIDEr on COCO

Standard Transformer captioners only attend within a single image-caption pair, ignoring semantic cues from other training samples.

On COCO, this limitation keeps a strong Transformer at 127.8 CIDEr, while PMA-Net reaches 131.5 CIDEr by exploiting past activations and prototypical memories.

HOW IT WORKS

Prototypical Memory Attention for Image Captioning

PMA-Net introduces memory banks, prototype keys, prototype values, and segment embeddings inside each decoder self-attention layer to attend over past activations.

You can think of PMA-Net like a captioner with a rolling RAM cache that clusters its own past thoughts into compact, reusable prototypes.

This prototypical memory lets PMA-Net retrieve and combine experience from other images, something a plain context window over a single sample cannot express.

DIAGRAM

Memory Bank Update and Prototype Generation Pipeline

This diagram shows how PMA-Net collects past keys and values into memory banks and periodically generates prototype keys and values using K-Means and k-NN.

DIAGRAM

Training Loop and Ablation Design for PMA-Net

This diagram shows how PMA-Net's training loop integrates memory updates, prototype generation, and ablation variants on COCO.

PROCESS

How PMA-Net Handles Image Captioning

  1. 01

    Memories as banks of past activations

    PMA-Net processes mini batches and stores decoder self-attention keys and values into memory banks BK and BV over a temporal window T.

  2. 02

    Building memory prototypes

    PMA-Net runs K-Means over BK to obtain prototype keys and uses k-NN over BK and BV to interpolate prototype values.

  3. 03

    Memory-augmented attention

    PMA-Net concatenates prototype keys and values with current keys and values, using segment embeddings to distinguish memory from input tokens.

  4. 04

    Memory bank update

    Using a strided sliding window, PMA-Net refreshes BK and BV twice per epoch, regenerating prototypes while keeping training stable.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Prototypical Memory Attention

    PMA-Net integrates Prototypical Memory Attention into decoder self-attention, letting queries attend prototype keys and values derived from past activations instead of static learnable vectors.

  • 02

    Memories as banks of past activations

    PMA-Net defines memory banks BK and BV that store keys and values from recent training iterations, modeling the manifold of past decoder activations.

  • 03

    Building memory prototypes

    PMA-Net introduces fast K-Means and k-NN based prototype keys and prototype values, yielding up to +3.7 CIDEr over a Transformer on COCO Karpathy test.

RESULTS

By the Numbers

CIDEr

131.5

+3.7 over Transformer† (127.8) under cross-entropy on COCO Karpathy test

BLEU-4

39.5

+2.1 over Transformer† (37.4) under cross-entropy on COCO Karpathy test

CIDEr SCST

144.1

+3.8 over Transformer† (140.3) under CIDEr optimization on COCO Karpathy test

COCO Test CIDEr c40

143.4

+3.4 over CaMEL (140.0) on COCO online test server with 40 references

On the COCO Karpathy test split, PMA-Net improves CIDEr from 127.8 to 131.5 over a re-trained Transformer† and reaches 144.1 CIDEr after CIDEr optimization. On the COCO online test server, PMA-Net attains 143.4 CIDEr c40, surpassing CaMEL and COS-Net while using fixed CLIP ViT-L/14 features.

BENCHMARK

By the Numbers

On the COCO Karpathy test split, PMA-Net improves CIDEr from 127.8 to 131.5 over a re-trained Transformer† and reaches 144.1 CIDEr after CIDEr optimization. On the COCO online test server, PMA-Net attains 143.4 CIDEr c40, surpassing CaMEL and COS-Net while using fixed CLIP ViT-L/14 features.

BENCHMARK

COCO Karpathy Test under Cross-Entropy Training (CIDEr)

CIDEr scores on the COCO Karpathy test split with cross-entropy training and CLIP ViT-L/14 features.

BENCHMARK

Ablation on Memory Size m and Bank Size T (CIDEr)

CIDEr scores for PMA-Net ablations on COCO Karpathy validation with different prototype counts m and memory bank sizes T.

KEY INSIGHT

The Counterintuitive Finding

Despite adding extra attention keys, PMA-Net reduces hallucination on robust COCO, cutting CHi from 2.8 to 2.6 compared to Transformer†.

You might expect more stored experiences to amplify hallucinations, but PMA-Net's prototypes instead regularize attention, slightly lowering CHs from 4.6 to 4.3 while increasing CIDEr from 119.6 to 122.0.

WHY IT MATTERS

What this unlocks for the field

PMA-Net shows that Transformer captioners can treat their own past activations as a compressed, queryable experience base via prototypical memory.

Builders can now design vision-language systems that reuse clustered training-time activations at inference, gaining +3.7 to +3.8 CIDEr without external databases or finetuning visual backbones.

~13 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

Questions about this paper?

Paper: With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Answers use this explainer on Memory Papers.

Checking…