Recurrent Neural Networks with External Memory for Language Understanding

AuthorsBaolin Peng, Kaisheng Yao

arXiv 20152015

TL;DR

RNN-EM uses an external memory of gated slots to store past hidden states and reaches 95.25% F1 on ATIS, +0.40 over LSTM.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

RNNs forget long term dependencies due to gradient vanishing and exploding

Simple recurrent neural networks suffer from gradient vanishing and exploding, which limits memory capacity because error signals cannot back propagate far enough.

In language understanding, this means semantic taggers cannot reliably connect distant words to their labels, hurting slot filling accuracy on datasets like ATIS.

HOW IT WORKS

RNN-EM architecture with external memory

RNN-EM introduces an external memory Mt, a key vector kt, a forget gate ft, and an update gate ut to control read write operations.

You can think of RNN-EM as a CPU with RAM, where the recurrent hidden layer ht is the processor and Mt is an addressable memory bank.

This design lets RNN-EM selectively retrieve and update past hidden activities beyond what a fixed recurrent state can hold, overcoming plain context window limitations.

DIAGRAM

RNN-EM memory read and write over time

This diagram shows how RNN-EM reads from Mt−1 and writes to Mt at each time step using wt, ft, ut, ct, and vt.

DIAGRAM

ATIS evaluation and memory size ablation pipeline

This diagram shows how RNN-EM is trained and evaluated on ATIS while sweeping memory slot number n.

PROCESS

How RNN-EM Handles a Language Understanding Sentence

01
Model input and output
RNN-EM maps each word window to an embedding xt and computes hidden activity ht using Wih and Wc with content ct from the external memory.
02
External memory read
RNN-EM generates key kt and sharpening scalar βt from ht, builds weight vector wt via cosine similarity with Mt, and reads ct = Mt−1 wt−1.
03
External memory update
RNN-EM produces new content vt, computes forget gate ft and update gate ut from wt and erase vector et, and updates Mt with vt using Eq. (19).
04
Output softmax prediction
RNN-EM feeds ht into Who to produce softmax output yt, giving a semantic tag for the current word in the ATIS sentence.

KEY CONTRIBUTIONS

Key Contributions

01
RNN-EM architecture with external memory
RNN-EM augments simple RNNs with an external memory Mt, key vector kt, and gates ft and ut to store and retrieve past hidden activities across sentences.
02
State of the art on ATIS
RNN-EM reaches 95.25% F1 on ATIS, surpassing LSTM at 94.85% and GRNN at 94.82% using comparable parameter counts around 7.3×10^3.
03
Memory size analysis for RNN-EM
RNN-EM systematically varies memory slots n from 1 to 512, showing best F1 95.22% at n = 8 and revealing non monotonic effects on training entropy.

RESULTS

By the Numbers

F1 score

95.25%

+0.40 over LSTM

F1 score

94.85%

LSTM baseline on ATIS

F1 score

94.82%

GRNN baseline with gates

F1 score

94.35%

CNN baseline without recurrent memory

On the ATIS spoken language understanding benchmark, which tests slot filling accuracy, RNN-EM's 95.25% F1 demonstrates that external memory improves semantic tagging beyond gated recurrent baselines.

BENCHMARK

By the Numbers

BENCHMARK

F1 scores on ATIS

F1 score on the ATIS language understanding task for RNN-EM and baseline models.

KEY INSIGHT

The Counterintuitive Finding

RNN-EM achieves its best F1 of 95.22% with only 8 memory slots, while larger memories up to 512 slots reduce F1 to as low as 94.53%.

This is surprising because increasing memory capacity is expected to help, but RNN-EM shows that too many slots can hurt training entropy and generalization.

WHY IT MATTERS

What this unlocks for the field

RNN-EM shows that a lightweight external memory with content based addressing can boost recurrent language understanding without complex Neural Turing Machine controllers.

Builders can now design compact slot filling systems that retain long range context across sentences using gated external memories rather than ever deeper or wider recurrent networks.

~10 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…