Working Memory Capacity of ChatGPT: An Empirical Study

AuthorsDongyu Gong, Xingchen Wan, Dingmin Wang

2023

TL;DR

Working Memory Capacity of ChatGPT uses n-back verbal and spatial tasks to show ChatGPT’s working memory drops to d′≈1 at n=3, mirroring humans.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs show human like capacity drop in n back tasks at n equals 3

Working Memory Capacity of ChatGPT reports that typical human performance drops significantly when n = 3 in n-back tasks, defining an average human capacity limit.

Working Memory Capacity of ChatGPT shows ChatGPT’s detection sensitivity similarly drops to around d′ = 1 at n = 3, limiting reasoning and problem solving over longer stimulus streams.

HOW IT WORKS

Working Memory Capacity of ChatGPT — n back tasks as a cognitive probe

Working Memory Capacity of ChatGPT builds verbal n-back experiments, spatial n-back experiments, and model comparison blocks to quantify working memory under noise, feedback, and chain-of-thought prompting.

You can think of Working Memory Capacity of ChatGPT as stress testing ChatGPT’s RAM with controlled letter and grid streams, similar to human cognitive psychology experiments.

This design lets Working Memory Capacity of ChatGPT expose a concrete capacity limit that a plain context window view cannot explain, linking LLM behavior to human working memory theories.

DIAGRAM

Trial by trial interaction in verbal and spatial n back tasks

This diagram shows how Working Memory Capacity of ChatGPT presents each trial to ChatGPT and records responses across verbal and spatial n-back blocks.

DIAGRAM

Evaluation pipeline across verbal spatial and model comparison experiments

This diagram shows how Working Memory Capacity of ChatGPT structures verbal and spatial experiments and then compares multiple LLMs on the base verbal n-back task.

PROCESS

How Working Memory Capacity of ChatGPT Handles an n back experiment

01
Verbal n-back experiments
Working Memory Capacity of ChatGPT generates 50 blocks of 24 letter sequences with 8 match and 16 nonmatch trials for n = 1, 2, 3.
02
Spatial n-back experiments
Working Memory Capacity of ChatGPT constructs 3 × 3 ASCII grids with X positions, plus larger 4 × 4, 5 × 5, and 7 × 7 grids for spatial variants.
03
Verbal n-back experiments with noise feedback and reasoning
Working Memory Capacity of ChatGPT adds 3 to 6 noise symbols, trial feedback, and chain-of-thought instructions to probe robustness and reasoning.
04
Model comparison
Working Memory Capacity of ChatGPT runs the base verbal n-back on Bloomz, ChatGLM, Vicuna, GPT-3.5, and GPT-4 to relate capacity to general capability.

KEY CONTRIBUTIONS

Key Contributions

01
Empirical analysis of the working memory capacity of ChatGPT
Working Memory Capacity of ChatGPT shows ChatGPT’s d′ drops to around 1 at n = 3 in verbal tasks, matching typical human capacity limits.
02
Verbal and spatial n-back experiments with variants
Working Memory Capacity of ChatGPT designs verbal and spatial n-back tasks with noise, feedback, CoT reasoning, abstract spatial rules, and varying grid sizes.
03
Model comparison using n-back as an intelligence index
Working Memory Capacity of ChatGPT finds GPT-4 has working memory capacity far exceeding Bloomz-7B, Bloomz-7B1-mt, ChatGLM-6B, and Vicuna models on verbal n-back.

RESULTS

By the Numbers

Detection sensitivity d′

≈1 at n = 3

capacity limit for ChatGPT verbal tasks compared to higher d′ at n = 1 and n = 2

Blocks per condition

50 blocks

per n level for both verbal and spatial n-back tasks

Trials per block

24 trials

with 8 match and 16 nonmatch trials in each block

Noise symbols per trial

3 to 6 symbols

added in verbal with noise to prevent simple string matching

Working Memory Capacity of ChatGPT uses 50 blocks of 24 trials for each n-back level on verbal and spatial tasks, computing detection sensitivity d′. The d′ ≈ 1 threshold at n = 3 proves Working Memory Capacity of ChatGPT captures a human-like working memory capacity limit in ChatGPT.

BENCHMARK

By the Numbers

BENCHMARK

Verbal n-back task base version across LLMs

Relative working memory capacity patterns inferred from d′ curves on the verbal n-back base task.

KEY INSIGHT

The Counterintuitive Finding

Working Memory Capacity of ChatGPT finds that increasing grid size from 3 × 3 to 7 × 7 actually increases spatial working memory capacity in ChatGPT.

This is surprising because more possible spatial locations should increase task difficulty, yet Working Memory Capacity of ChatGPT observes less interference and better performance with larger grids.

WHY IT MATTERS

What this unlocks for the field

Working Memory Capacity of ChatGPT gives LLM researchers a cognitively grounded, n-back based tool to quantify working memory capacity across models and prompts.

With Working Memory Capacity of ChatGPT, builders can systematically tune instructions, feedback, and reasoning prompts to push LLMs closer to human-like working memory profiles for complex reasoning tasks.

~10 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…