A Dataset and Architecture for Visual Reasoning with a Working Memory

AuthorsGuangyu Robert Yang, Igor Ganichev, Xiao-Jing Wang et al.

arXiv 20182018

TL;DR

COG + recurrent controller with feature and spatial attention enables near-human CLEVR reasoning at 96.8% accuracy while supporting zero-shot generalization across 44 compositional tasks.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Visual agents lack working memory for time varying reasoning

Existing VQA systems often exploit dataset biases instead of true reasoning, and typically avoid time and memory in their design.

Without robust working memory, agents struggle on tasks requiring tracking objects across frames, limiting reliable video understanding and sequential decision making.

HOW IT WORKS

COG architecture for visual reasoning with working memory

SYS_NAME centers on a Visual processing CNN, Semantic processing BiLSTM, Visual short-term memory module, and a recurrent Controller coordinating attention and gating.

You can think of SYS_NAME like a brain-inspired system where the CNN is the eye, the BiLSTM is language cortex, vSTM is parietal spatial memory, and the Controller is prefrontal cortex.

This coordination plus iterative pondering lets SYS_NAME chain multi step operations over time, something a plain context window or single pass CNN LSTM stack cannot achieve.

DIAGRAM

Pondering based inference flow over an image sequence

This diagram shows how SYS_NAME iteratively processes each image and instruction token using pondering steps, attention, and visual short term memory.

DIAGRAM

COG dataset generation and evaluation pipeline

This diagram shows how SYS_NAME is trained and evaluated on COG using operator graphs, backward image generation, and sequence level supervision.

PROCESS

How SYS_NAME Handles a COG task instance

  1. 01

    Task execution

    During Task execution, SYS_NAME receives each image and instruction, with Visual processing and Semantic processing feeding features into the Controller at every pondering step.

  2. 02

    Image sequence generation

    In Image sequence generation, SYS_NAME is trained on COG sequences produced by operator graphs that specify object attributes, relations, and required Visual short-term memory usage.

  3. 03

    Forward pass through the graph

    A Forward pass through the graph lets SYS_NAME simulate operator compositions by iteratively updating Controller state, attention, and Visual short-term memory across frames.

  4. 04

    Backward pass through the graph

    The Backward pass through the graph is used by COG to create minimally biased inputs, and SYS_NAME learns to invert this logic via its recurrent Controller and attention mechanisms.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    The COG dataset

    SYS_NAME introduces COG, a compositional dataset with 8 operators, 44 tasks, and more than 2 trillion possible task instances exercising working memory and visual reasoning.

  • 02

    Multi modal recurrent architecture

    SYS_NAME defines a multi modal architecture combining Visual processing, Semantic processing, Visual short-term memory, and a Controller with feature and spatial attention for temporal reasoning.

  • 03

    Zero shot task generalization

    SYS_NAME demonstrates zero shot generalization where networks trained on 43 tasks reach 85.4% average accuracy on the excluded task, far above the 26.7% chance level.

RESULTS

By the Numbers

Overall

96.8%

+1.3 over CNN+LSTM+RN

Count

91.7%

+1.6 over CNN+LSTM+RN

Exist

99.0%

+1.2 over CNN+LSTM+RN

Compare Numbers

95.5%

+1.9 over CNN+LSTM+RN

These metrics come from the CLEVR test set, which probes compositional visual reasoning. MAIN_RESULT shows SYS_NAME matches specialized CLEVR models while also supporting working memory for COG.

BENCHMARK

By the Numbers

These metrics come from the CLEVR test set, which probes compositional visual reasoning. MAIN_RESULT shows SYS_NAME matches specialized CLEVR models while also supporting working memory for COG.

BENCHMARK

CLEVR test accuracies for human, baseline, and top performing models

Overall accuracy on CLEVR.

KEY INSIGHT

The Counterintuitive Finding

SYS_NAME relies heavily on feature attention for CLEVR but primarily on spatial attention for COG, even though both are visual reasoning benchmarks.

This is surprising because one might expect the same attention mechanism to dominate across datasets, but SYS_NAME instead adapts mechanisms to object combinatorics and temporal demands.

WHY IT MATTERS

What this unlocks for the field

SYS_NAME shows that a single architecture can handle static CLEVR style reasoning and temporally extended COG tasks with explicit working memory.

Builders can now prototype agents that follow language instructions over image sequences, track objects through time, and even generalize to unseen tasks without retraining.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: A Dataset and Architecture for Visual Reasoning with a Working Memory

Answers use this explainer on Memory Papers.

Checking…