On the Long-Term Memory of Deep Recurrent Networks

AuthorsYoav Levine, Or Sharir, Alon Ziv, Amnon Shashua

arXiv 20172017

TL;DR

On the Long-Term Memory of Deep Recurrent Networks uses Start-End separation rank on Recurrent Arithmetic Circuits to prove depth gives combinatorial long-range dependency capacity.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Recurrent networks lack a formal long-term memory measure for depth

On the Long-Term Memory of Deep Recurrent Networks notes that a well established measure of RNNs long-term memory capacity is lacking, limiting formal understanding of depth.

This gap means deep recurrent networks used in language modeling and speech recognition lack theory explaining how they correlate information across long sequences.

HOW IT WORKS

Recurrent Arithmetic Circuits and Start-End separation rank

On the Long-Term Memory of Deep Recurrent Networks introduces Recurrent Arithmetic Circuits, Start-End separation rank, and grid tensors to study temporal dependencies in deep recurrent networks.

You can think of this like measuring how many independent “wires” of information can flow from the start to the end of a sequence, analogous to parallel channels in computer RAM.

This Start-End separation rank lets On the Long-Term Memory of Deep Recurrent Networks formalize long-term dependency capacity that a plain finite context window or shallow RNN cannot capture.

DIAGRAM

Temporal computation in deep Recurrent Arithmetic Circuits

This diagram shows how On the Long-Term Memory of Deep Recurrent Networks stacks RAC layers over time to mix hidden states and inputs.

DIAGRAM

Evaluation pipeline for Copying Memory and Start-End Similarity tasks

This diagram shows how On the Long-Term Memory of Deep Recurrent Networks trains EURNNs of different depths on synthetic long-term memory benchmarks.

PROCESS

How On the Long-Term Memory of Deep Recurrent Networks analyzes temporal expressivity

  1. 01

    Recurrent Arithmetic Circuits definition

    On the Long-Term Memory of Deep Recurrent Networks defines Recurrent Arithmetic Circuits with multiplicative integration gRAC(a,b) = a⊙b and stacked hidden layers over time.

  2. 02

    Start-End separation rank

    On the Long-Term Memory of Deep Recurrent Networks introduces Start-End separation rank to quantify how far a sequence function is from being separable between early and late inputs.

  3. 03

    Grid tensors and Tensor Networks

    On the Long-Term Memory of Deep Recurrent Networks builds grid tensors and Tensor Network representations to relate separation rank to matrix ranks and graph min-cuts.

  4. 04

    Depth separation theorems and conjecture

    On the Long-Term Memory of Deep Recurrent Networks proves depth-2 RACs have combinatorially larger Start-End separation rank than depth-1, and conjectures combinatorial growth with depth L.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Start-End separation rank for recurrent networks

    On the Long-Term Memory of Deep Recurrent Networks formalizes Start-End separation rank using grid tensors and Tensor Networks to measure long-term temporal dependencies in RACs.

  • 02

    Depth efficiency for Recurrent Arithmetic Circuits

    On the Long-Term Memory of Deep Recurrent Networks proves depth-2 RACs support Start-End separation ranks that are combinatorially higher than shallow depth-1 RACs with the same channels.

  • 03

    Connection to Tensor Train and quantum Tensor Networks

    On the Long-Term Memory of Deep Recurrent Networks links shallow RACs to Tensor Train decompositions and uses quantum Tensor Network min-cut arguments to analyze deep temporal expressivity.

RESULTS

By the Numbers

Start End separation rank shallow

min{R, M^{T/2}}

upper bound for depth1 RAC in Theorem 4.1

Start End separation rank deep

(min{M,R} + T/2 − 1 choose T/2)

combinatorial lower bound for depth2 RAC over shallow

Copying Memory bits

150 bits

m = 30, n = 32 in Copying Memory Task setup

Training set size

100000 examples

synthetic sequences for EURNN experiments

On the Long-Term Memory of Deep Recurrent Networks evaluates EURNNs on a 150-bit Copying Memory Task and a Start-End Similarity Task, showing deeper networks handle longer delays and sequences than shallower ones under equal MAC budgets.

BENCHMARK

By the Numbers

On the Long-Term Memory of Deep Recurrent Networks evaluates EURNNs on a 150-bit Copying Memory Task and a Start-End Similarity Task, showing deeper networks handle longer delays and sequences than shallower ones under equal MAC budgets.

BENCHMARK

Depth vs delay on 150-bit Copying Memory Task

Maximal delay time B solved with ≥99% data-accuracy on the 150-bit Copying Memory Task for different EURNN depths at similar MAC budgets.

KEY INSIGHT

The Counterintuitive Finding

On the Long-Term Memory of Deep Recurrent Networks shows depth-2 RACs can reach Start-End separation rank on the order of (min{M,R} + T/2 − 1 choose T/2), while depth-1 RACs are capped at min{R, M^{T/2}}.

This is surprising because many practitioners assume adding channels is equivalent to adding layers, yet the theory proves only depth yields combinatorial long-term dependency capacity.

WHY IT MATTERS

What this unlocks for the field

On the Long-Term Memory of Deep Recurrent Networks gives a concrete tool to reason about how recurrent depth scales long-term memory via Start-End separation rank.

Armed with this, builders can design deep recurrent architectures whose memory capacity grows with sequence length and depth, instead of relying on shallow RNNs that saturate regardless of longer inputs.

~16 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

Questions about this paper?

Paper: On the Long-Term Memory of Deep Recurrent Networks

Answers use this explainer on Memory Papers.

Checking…