Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models

AuthorsYabin Zhang, Wenjie Zhu, Hui Tang et al.

arXiv 20242024

TL;DR

Dual Memory Networks uses dynamic and static memory readout over CLIP features to reach 72.25% zero-shot ImageNet vs 66.73% CLIP (+5.52 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Adaptation methods only handle one setting and miss historical test data (zero-shot gains over 3%)

Most CLIP adaptation methods are tailored to only one or two paradigms and cannot handle zero-shot, few-shot, and training-free few-shot together.

Without using historical test samples, zero-shot methods plateau, while Dual Memory Networks shows over 3% higher accuracy than prior zero-shot baselines and even beats methods with external training data.

HOW IT WORKS

Dual Memory Networks — dynamic and static memory with flexible readout

Dual Memory Networks centers on a Dynamic Memory Network, Static Memory Network, shared ReadOut function, and residual Projection Layers ω on top of frozen CLIP encoders.

You can think of the static memory as a read-only disk of labeled training features and the dynamic memory as RAM that logs and refines knowledge from historical test samples.

By turning memory interaction into a single attention-based readout, Dual Memory Networks builds sample-adaptive classifiers that go beyond a fixed text classifier and plain CLIP context.

DIAGRAM

Test-time memory interaction and update flow

This diagram shows how Dual Memory Networks processes a test sample, updates dynamic memory, and reads from both memories to form sample-adaptive classifiers.

DIAGRAM

Evaluation pipeline across three adaptation settings

This diagram shows how Dual Memory Networks is configured and evaluated for zero-shot, training-free few-shot, and few-shot settings on 11 datasets.

PROCESS

How Dual Memory Networks Handles a Classification Sample

01
Convert a new input x into the feature space
Dual Memory Networks uses frozen CLIP encoders to map the image to feature v and class texts to the text classifier C, feeding both into the ReadOut module.
02
Update the memory M with x
Dual Memory Networks writes v into the category-split dynamic memory Md using pseudo label y and entropy-based slot replacement, while static memory Ms holds labeled training features when available.
03
Read out an output given x and the current memory
Dual Memory Networks applies the ReadOut cross-attention with projection layers ωq, ωk, ωv, ωo over Md and Ms to produce sample-adaptive classifiers Cd and Cs.
04
Convert the output into the desired response
Dual Memory Networks computes Pd and Ps via M2P, combines them with Pt as Pdmn = α1Pt + α2Pd + α3Ps, and outputs the final class prediction.

KEY CONTRIBUTIONS

Key Contributions

01
Versatile Dual Memory Networks for three adaptation tasks
Dual Memory Networks unifies zero-shot, few-shot, and training-free few-shot adaptation without external training data, using a Dynamic Memory Network and Static Memory Network plus shared ReadOut.
02
Flexible memory interactive strategy with projection layers
Dual Memory Networks introduces a residual Projection Layers ω design that degenerates to identity in training-free mode and is trainable in few-shot mode, enabling efficient sample-adaptive classifiers.
03
State-of-the-art results and robustness to distribution shifts
Dual Memory Networks reaches 72.25% zero-shot ImageNet with ViT-B/16 and 63.71% mean over 11 datasets, and improves ImageNet-A accuracy from 47.87% to 58.28% under distribution shifts.

RESULTS

By the Numbers

ImageNet accuracy

72.25%

+5.52 over CLIP-ViTB/16

Mean accuracy

70.72%

+4.81 over CLIP-ViTB/16

ImageNet-A accuracy

58.28%

+10.41 over CLIP-ViTB/16

Zero-shot mean accuracy

63.71%

+4.26 over CALIP-RN50 on 11 datasets

On eleven zero-shot benchmarks including ImageNet, Flowers102, DTD, and EuroSAT, Dual Memory Networks consistently raises accuracy over CLIP and TPT, showing that dynamic and static memories materially improve CLIP adaptation without external data.

BENCHMARK

By the Numbers

BENCHMARK

Zero-shot ImageNet accuracy with ViT-B/16 backbone

Top-1 accuracy (%) on ImageNet zero-shot classification with ViT-B/16 encoders.

KEY INSIGHT

The Counterintuitive Finding

Dual Memory Networks without any external training data reaches 72.25% zero-shot ImageNet, beating CaFo with synthetic data by 1.48% mean accuracy.

This is surprising because synthetic labeled images from powerful generators were expected to be more useful than unlabeled historical test samples cached online.

WHY IT MATTERS

What this unlocks for the field

Dual Memory Networks shows that simple attention-based memories over CLIP features can unify zero-shot, few-shot, and training-free few-shot adaptation in one framework.

Builders can now deploy a single CLIP-based system that improves over time from test streams, uses few-shot labels when available, and still works when no training data or external generators exist.

~12 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…