Adaptive Posterior Learning: few-shot learning with a surprise-based memory module

AuthorsTiago Ramalho, Marta Garnelo

arXiv 20192019

TL;DR

Adaptive Posterior Learning uses a surprise-based memory controller with relational decoders to reach 99.9% 5-way Omniglot accuracy while storing under 2 examples per class.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Few shot learners need too much context and memory

Meta-learning systems often assume a fixed context size, forcing them to store many redundant examples and limiting scalability to thousands of classes.

On Omniglot 20-way classification, Adaptive Posterior Learning reaches 98.5% accuracy with only 44 stored items, while fixed 5-shot baselines require 100 items for comparable accuracy, wasting memory and compute.

HOW IT WORKS

Adaptive Posterior Learning — surprise based memory with relational decoding

Adaptive Posterior Learning combines an Encoder, Memory store, Memory controller, and Decoder to approximate posteriors from a sparse set of surprising observations.

You can think of Adaptive Posterior Learning as a student with a notebook who only writes down questions it gets wrong, then uses a smart comparison engine to reason over those notes.

This surprise-driven writing plus relational decoding lets Adaptive Posterior Learning adapt online and perform reasoning style generalization that a plain fixed context window cannot support.

DIAGRAM

Online inference and surprise based writing flow

This diagram shows how Adaptive Posterior Learning processes a sequence of (x, y) pairs, updates predictions, and writes surprising items to memory during an episode.

DIAGRAM

Episode training loop and dataset setup

This diagram shows how Adaptive Posterior Learning is trained over Omniglot episodes with shuffled label mappings and per step updates.

PROCESS

How Adaptive Posterior Learning Handles a Few shot Episode

01
Architecture
Adaptive Posterior Learning wires an Encoder, Memory store, Memory controller, and Decoder so that predictions can be conditioned on retrieved neighbors.
02
Memory store
Adaptive Posterior Learning stores rows of embeddings and labels in the Memory store and retrieves k nearest neighbors via euclidean distance.
03
Memory controller
Adaptive Posterior Learning uses the Memory controller with surprise S = −ln(yt) and threshold σ ∝ −ln(N) to decide which examples to write.
04
Decoder
Adaptive Posterior Learning runs the Decoder, such as the relational self attention feed forward module, to combine query and neighbors into class logits.

KEY CONTRIBUTIONS

Key Contributions

01
Surprise based memory controller
Adaptive Posterior Learning introduces a Memory controller using surprise S = −ln(yt) with σ ∝ −ln(N) to store only predictive items, achieving 98.5% accuracy with 44 Omniglot items.
02
Integrated external and working memory
Adaptive Posterior Learning combines a sparse kNN Memory store with relational Decoder architectures, including relational working memory and LSTM decoders, for scalable reasoning.
03
Training without sequence backpropagation
Adaptive Posterior Learning uses per step cross entropy updates over episodes, avoiding backpropagation through time while still learning an approximate posterior update algorithm.

RESULTS

By the Numbers

5 way 5 shot accuracy

99.9%

0.0pp vs MAML and SNAIL

20 way 1 shot accuracy

97.2%

+1.4pp over Matching nets

423 way 5 shot accuracy

88.0%

423 way Omniglot with fixed 5 shot context

1000 way 5 shot accuracy

78.9%

1000 way Omniglot with rotated pseudoclasses

On Omniglot few shot classification, which tests rapid generalization to unseen characters, Adaptive Posterior Learning matches or exceeds strong baselines like Matching nets and MAML. The 99.9% 5 way 5 shot result while storing under 2 examples per class shows that Adaptive Posterior Learning compresses context aggressively without losing accuracy.

BENCHMARK

By the Numbers

BENCHMARK

Omniglot few shot classification accuracies

Test accuracy (%) on Omniglot for 20 way 1 shot classification.

KEY INSIGHT

The Counterintuitive Finding

Adaptive Posterior Learning reaches 98.5% accuracy on 20 way Omniglot while storing only 44 items in memory, fewer than 2 examples per class.

This is surprising because fixed 5 shot methods like Matching nets use 100 context items for the same task, so Adaptive Posterior Learning achieves higher accuracy with less than half the stored data.

WHY IT MATTERS

What this unlocks for the field

Adaptive Posterior Learning shows that a simple surprise threshold plus kNN retrieval can support few shot learning and even number analogy reasoning with fewer than one example per class.

Builders can now design meta learners that scale to hundreds or thousands of labels, like 1000 way Omniglot and 1000 way ImageNet label shuffles, without exploding memory footprints or needing sequence level backpropagation.

~12 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…