Learning to Learn Variational Semantic Memory

AuthorsXiantong Zhen, Yingjun Du, Huan Xiong et al.

arXiv 20202020

TL;DR

Variational Semantic Memory uses latent-memory variational inference over prototypes to reach 65.72% 1-shot accuracy on miniImageNet, +0.90 points over Tian et al. 2020.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Few shot learners lack long term semantic memory for prototypes

Existing few-shot systems typically store only current-task support sets, using short-term memory that is wiped between episodes and cannot retain long-term knowledge.

This limits prototype quality when only one or few examples are available, causing under-representative class concepts and weaker recognition on benchmarks like miniImageNet and tieredImageNet.

HOW IT WORKS

Variational semantic memory for probabilistic prototypes

Variational Semantic Memory introduces variational prototype inference, variational semantic memory, latent memory m, and an attention-based memory update to build distributional prototypes conditioned on long-term semantics.

You can think of Variational Semantic Memory as a brain-like semantic store plus a probabilistic decoder, where the external memory is cortex and the latent memory m is a hippocampus-style bottleneck.

This hierarchical variational design lets Variational Semantic Memory recall and adapt past semantic knowledge per task, something a fixed context window or deterministic prototype cannot achieve.

DIAGRAM

Hierarchical variational inference and memory recall pipeline

This diagram shows how Variational Semantic Memory performs hierarchical variational inference over prototypes and latent memory m during few-shot episodes.

DIAGRAM

Training and evaluation pipeline for few shot benchmarks

This diagram shows how Variational Semantic Memory is trained episodically and evaluated on miniImageNet, tieredImageNet, and CIFAR-FS.

PROCESS

How Variational Semantic Memory Handles a Few shot Episode

01
Variational prototype inference
Variational Semantic Memory encodes the support set and uses variational prototype inference to parameterize a Gaussian distribution q(z|S) for each class.
02
Variational semantic memory
Variational Semantic Memory addresses the external semantic memory M with support features, forming q(m|M,S) as a mixture over memory entries using learned similarities.
03
Memory recall and inference
Variational Semantic Memory samples latent memory m, conditions q(z|m,S), and optimizes the hierarchical ELBO with Monte Carlo estimates for prototypes and latent memory.
04
Memory update and consolidation
Variational Semantic Memory updates each memory entry using an attention based graph over Mc and class samples, then applies Mc ← αMc + (1−α) Mc bar.

KEY CONTRIBUTIONS

Key Contributions

01
Variational semantic memory
Variational Semantic Memory introduces a long term semantic memory module M plus latent memory m, enabling hierarchical variational prototype inference across tasks with bounded memory size.
02
Variational prototype network
Variational Semantic Memory extends prototypical networks with variational prototype inference, modeling each prototype as a Gaussian distribution q(z|S) and improving miniImageNet 1 shot from 47.40% to 52.11%.
03
Attentional memory update
Variational Semantic Memory uses an attention based memory update over graphs Hc, yielding 54.73% vs 53.97% 1 shot on miniImageNet compared to mean based updates.

RESULTS

By the Numbers

miniImageNet 1 shot

65.72%

+0.90 over Tian et al. 2020

miniImageNet 5 shot

82.73%

+0.59 over Tian et al. 2020

tieredImageNet 1 shot

72.01%

+0.49 over Tian et al. 2020

tieredImageNet 5 shot

86.77%

+0.74 over Tian et al. 2020

These accuracies are reported on 5 way few shot classification for miniImageNet and tieredImageNet, demonstrating that Variational Semantic Memory scales from shallow to deep backbones while improving over strong metric learning baselines.

BENCHMARK

By the Numbers

BENCHMARK

miniImageNet 5 way 1 shot with shallow feature extractor

Accuracy (%) comparison of Variational Semantic Memory against ProtoNet, VERSA, and MAML on miniImageNet 5 way 1 shot.

BENCHMARK

Effect of variational semantic memory on miniImageNet 1 shot

Accuracy (%) for Variational Semantic Memory versus rote and transformed memory variants on miniImageNet 5 way 1 shot.

KEY INSIGHT

The Counterintuitive Finding

On miniImageNet 5 way 1 shot, Variational Semantic Memory with a shallow encoder reaches 54.73%, beating VERSA at 53.31% despite using simple prototypes.

This is surprising because VERSA already amortizes inference over tasks, yet Variational Semantic Memory gains +1.42 points mainly by adding a probabilistic semantic memory module.

WHY IT MATTERS

What this unlocks for the field

Variational Semantic Memory shows that long term semantic memory plus latent memory inference can systematically improve prototype quality under extreme data scarcity.

Builders can now design few shot systems that accumulate reusable semantic knowledge across tasks, rather than relearning prototypes from scratch for every new episode.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…