Symbolic Working Memory Enhances Language Models for Complex Rule Application

AuthorsSiyuan Wang, Zhongyu Wei, Yejin Choi, Xiang Ren

2024

TL;DR

WM-Neurosymbolic uses symbolic rule grounding plus external working memory to boost GPT-4 accuracy on CLUTRR from 85.53% to 92.34%.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs fail on multi-step rule application with non-sequential rules (GPT-4 accuracy collapses beyond single-step)

LLMs handle single-step rule application well, but GPT-4 accuracy drops sharply as rule application steps increase, especially with non-sequential rule inputs.

On CLUTRR, GPT-4 scratchpad reasoning loses substantial accuracy when moving from one-step to multi-step reasoning, causing failures in multi-step deductive reasoning with shuffled rules and distractors.

HOW IT WORKS

Working Memory based Neurosymbolic Framework for Rule Application

WM-Neurosymbolic introduces an External Working Memory with a Fact Base, Rule Base, and Memory Schema plus staged Working Memory Initialization and Symbolic Rule Grounding.

Think of WM-Neurosymbolic like a CPU using RAM plus a symbolic index: LLM calls are the processor, while the working memory is structured RAM with a Prolog-style card catalog.

By separating symbolic rule grounding from LLM-based rule implementation, WM-Neurosymbolic performs precise multi-step rule application that a plain context window and scratchpad reasoning cannot sustain.

DIAGRAM

Multi-step Rule Application Cycle in WM-Neurosymbolic

This diagram shows how WM-Neurosymbolic iteratively grounds rules symbolically and calls an LLM to implement them until the query is solved.

DIAGRAM

Evaluation Pipeline and Ablation Design for WM-Neurosymbolic

This diagram shows how WM-Neurosymbolic is evaluated across four datasets and compared with CoT-based and symbolic baselines plus ablations.

PROCESS

How WM-Neurosymbolic Handles a Multi-step Rule Application Query

01
Working Memory Initialization
WM-Neurosymbolic decomposes the context into sentences and uses Working Memory Initialization to populate the Fact Base, Rule Base, and Memory Schema with natural language and symbolic entries.
02
Symbolic Rule Grounding
WM-Neurosymbolic runs Symbolic Rule Grounding to perform predicate matching and variable matching over the symbolic facts and rules in the External Working Memory.
03
LLM-based Rule Implementation
WM-Neurosymbolic invokes LLM-based Rule Implementation to infer new facts in both natural language and Prolog form, then writes them back into the Fact Base.
04
Final Answer Prediction
WM-Neurosymbolic checks whether inferred facts resolve the query in Final Answer Prediction, otherwise iterates further or falls back to scratchpad CoT as backup.

KEY CONTRIBUTIONS

Key Contributions

01
External Working Memory for Rule Application
WM-Neurosymbolic introduces an External Working Memory with a Fact Base, Rule Base, and Memory Schema that stores facts and rules in both natural language and symbolic forms for precise tracking.
02
Neurosymbolic Rule Grounding and Implementation
WM-Neurosymbolic disentangles Symbolic Rule Grounding from LLM-based Rule Implementation, enabling predicate and variable matching plus single-step rule application to handle complex multi-step reasoning.
03
Robust Multi-dataset Improvements
WM-Neurosymbolic achieves 92.34% on CLUTRR and 77.33% on ProofWriter with GPT-4, improving over Self-Consistency CoT by +6.81 and +15.33 accuracy points respectively.

RESULTS

By the Numbers

CLUTRR accuracy %

92.34%

+6.81 over SC-CoT

ProofWriter accuracy %

77.33%

+11.66 over SymbCoT

AR-LSAT accuracy %

70.00%

+10.00 over SymbCoT

Boxes accuracy %

100%

+6.67 over SC-CoT

On CLUTRR, ProofWriter, AR-LSAT, and Boxes, which test multi-step logical reasoning, constraint satisfaction, and object state tracking, WM-Neurosymbolic consistently delivers higher accuracy than CoT and symbolic baselines. These MAIN_RESULT numbers show WM-Neurosymbolic scales to longer rule chains and noisy rule settings where scratchpad reasoning degrades.

BENCHMARK

By the Numbers

BENCHMARK

Overall results on CLUTRR with GPT-4

Accuracy % on CLUTRR for WM-Neurosymbolic and key GPT-4 baselines.

BENCHMARK

Overall results on ProofWriter with GPT-4

Accuracy % on ProofWriter for WM-Neurosymbolic and symbolic or CoT baselines.

KEY INSIGHT

The Counterintuitive Finding

WM-Neurosymbolic reaches 100% accuracy on Boxes with GPT-4, while Self-Consistency CoT already scores a strong 93.33% using the same backbone.

This is surprising because Boxes involves relatively short operation sequences, yet WM-Neurosymbolic still gains +6.67 points, contradicting the intuition that external working memory only helps on extremely long chains.

WHY IT MATTERS

What this unlocks for the field

WM-Neurosymbolic unlocks reliable multi-step rule application under shuffled and noisy rules by combining symbolic grounding with LLM-based implementation over an explicit working memory.

Builders can now design systems that execute long deductive chains, constraint satisfaction, and state tracking with transparent intermediate facts instead of brittle monolithic chain-of-thought prompts.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…