Working Memory Connections for LSTM

AuthorsFederico Landi, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

arXiv 20212021

TL;DR

Working Memory Connections for LSTM adds tanh‑projected cell‑to‑gate links so LSTM-WM reaches 1.299 BPC on PTB vs 1.334 for LSTM (−0.035).

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LSTMs ignore their own cell state in gate decisions

LSTM-WM addresses that the memory cell contains useful information that is not allowed to influence the gating mechanism directly.

When LSTM gates ignore the cell state, long sequences like T=400 adding or pMNIST can stall at trivial solutions and unstable training.

HOW IT WORKS

Working Memory Connections for LSTM

LSTM-WM introduces Working Memory Connections that add tanh‑projected memory cell signals into the input, forget, and output LSTM gates via weights Wic, Wfc, Woc.

You can think of LSTM-WM as giving the LSTM a small working memory scratchpad, like RAM that summarizes long-term disk contents before deciding the next operation.

This protected projection lets LSTM-WM adjust gates based on intra-cell knowledge, enabling behaviors plain LSTM context windows and raw peephole connections cannot express without instability.

DIAGRAM

Gate computation with Working Memory Connections

This diagram shows how LSTM-WM computes gates it, ft, ot using xt, ht-1, and tanh-projected ct or ct-1.

DIAGRAM

Training and evaluation pipeline for LSTM-WM

This diagram shows how LSTM-WM is trained and evaluated across adding, copying, sMNIST, pMNIST, PTB, and COCO captioning tasks.

PROCESS

How LSTM-WM Handles a Long Sequence Task

  1. 01

    LSTM equations

    LSTM-WM starts from standard LSTM equations for gt, it, ft, ct, ot, ht, preserving the memory cell and LSTM gates structure.

  2. 02

    Working Memory Connections

    LSTM-WM injects Working Memory Connections by adding tanh(Wic ct-1), tanh(Wfc ct-1), and tanh(Woc ct) into gate pre-activations.

  3. 03

    Advantages of Working Memory Connections

    LSTM-WM analyzes local gradients on Wic, Wfc, Woc to show bounded influence of cell state and stable updates versus peephole connections.

  4. 04

    Experiments and Results

    LSTM-WM is trained on adding, copying, sMNIST, pMNIST, PTB, and COCO captioning to quantify gains over LSTM and LSTM-PH.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Working Memory Connections

    LSTM-WM enriches LSTM gates with protected Working Memory Connections from the memory cell, using tanh projections Wic, Wfc, Woc into it, ft, and ot.

  • 02

    Analysis of peephole connections

    LSTM-WM formally shows that unbounded peephole connections cause gate saturation and gradients ∂it/∂Wic growing linearly with ct, destabilizing learning.

  • 03

    Broad experimental evaluation

    LSTM-WM improves sMNIST accuracy to 98.63% and PTB BPC to 1.299, and adds up to 2.0 CIDEr on COCO captioning over vanilla LSTM.

RESULTS

By the Numbers

sMNIST accuracy

98.63%

+0.47 percentage points over LSTM (98.16%)

pMNIST accuracy

93.97%

+1.03 percentage points over LSTM (92.94%)

PTB BPC TPTB=150

1.299

−0.035 BPC vs LSTM (1.334) with fixed parameters

COCO CIDEr no attention

94.0

+2.0 CIDEr vs LSTM (92.0) on Show and Tell with ResNet-152

On sequential MNIST, permuted MNIST, PTB character-level language modeling, and COCO captioning, LSTM-WM consistently improves accuracy or BPC over LSTM and LSTM-PH. These MAIN_RESULT numbers show that exposing cell state through Working Memory Connections yields both better sequence modeling and more stable training.

BENCHMARK

By the Numbers

On sequential MNIST, permuted MNIST, PTB character-level language modeling, and COCO captioning, LSTM-WM consistently improves accuracy or BPC over LSTM and LSTM-PH. These MAIN_RESULT numbers show that exposing cell state through Working Memory Connections yields both better sequence modeling and more stable training.

BENCHMARK

Sequential MNIST test accuracy comparison

Test accuracy (%) on sMNIST from Table 1.

BENCHMARK

PTB character-level BPC with TPTB=150 and fixed parameters

Mean test bits per character (BPC) on PTB with truncated BPTT length 150 and ~2.2M parameters.

KEY INSIGHT

The Counterintuitive Finding

LSTM-WM with 128 hidden units beats LSTM and LSTM-PH with 256 units on pMNIST, reaching 93.97% accuracy despite having less than half the parameters.

This is surprising because we usually expect larger LSTMs to win, but Working Memory Connections make a smaller LSTM-WM more effective than much bigger baselines.

WHY IT MATTERS

What this unlocks for the field

LSTM-WM shows that carefully protected cell-to-gate access lets recurrent networks exploit long-term cell information without exploding gradients or gate saturation.

Builders can now retrofit existing LSTM architectures with Working Memory Connections to get longer effective memory and faster convergence on long-sequence tasks without switching to heavier Transformers.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Working Memory Connections for LSTM

Answers use this explainer on Memory Papers.

Checking…