Recurrent Neural Networks and Long Short-Term Memory Networks: Tutorial and Survey

AuthorsBenyamin Ghojogh, Ali Ghodsi

arXiv 20232023

TL;DR

Recurrent Neural Networks and Long Short-Term Memory Networks uses gated cells and bidirectional variants to unify RNN, LSTM, GRU, and ELMo-style architectures conceptually.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Gradient vanishing and explosion in long term dependencies

Recurrent Neural Networks and Long Short-Term Memory Networks highlights that long-term dependencies cause gradient vanishing or exploding gradients in RNNs (Bengio et al., 1993; 1994).

When this happens, sequence models for language and speech either forget distant context or become numerically unstable, harming downstream prediction quality.

HOW IT WORKS

Recurrent Neural Networks and Long Short-Term Memory Networks — the core mechanism

Recurrent Neural Networks and Long Short-Term Memory Networks organizes Backpropagation Through Time, LSTM gates and cells, Gated Recurrent Units, bidirectional RNN, and ELMo into a single dynamical-systems tutorial.

You can think of LSTM and GRU as a programmable working memory, like RAM with gates that decide what to overwrite, keep, or read, while bidirectional RNNs act like reading a sentence both forwards and backwards.

This gated and bidirectional view in Recurrent Neural Networks and Long Short-Term Memory Networks explains how architectures overcome gradient issues and capture long-range context beyond a plain feedforward context window.

DIAGRAM

Information flow through an LSTM cell

This diagram shows how Recurrent Neural Networks and Long Short-Term Memory Networks defines the internal data flow of one LSTM cell using input, forget, output, and new memory gates.

DIAGRAM

Training loop with Backpropagation Through Time

This diagram shows how Recurrent Neural Networks and Long Short-Term Memory Networks trains RNN parameters over T time steps using Backpropagation Through Time and gradient descent.

PROCESS

How Recurrent Neural Networks and Long Short-Term Memory Networks Handles sequence modeling

01
Dynamical System
Recurrent Neural Networks and Long Short-Term Memory Networks first frames sequence processing as a dynamical system ht = fθ(ht−1, xt) with shared parameters across time.
02
Backpropagation Through Time
Recurrent Neural Networks and Long Short-Term Memory Networks derives Backpropagation Through Time to compute gradients ∂L/∂ht, ∂L/∂W, ∂L/∂U, and ∂L/∂V over T previous steps.
03
Long Short Term Memory Network
Recurrent Neural Networks and Long Short-Term Memory Networks introduces the Long Short-Term Memory Network with input, forget, output, and new memory gates controlling the cell state.
04
Gated Recurrent Units
Recurrent Neural Networks and Long Short-Term Memory Networks then presents Gated Recurrent Units with reset and update gates, and minimal gated units that merge them into a single forget gate.

KEY CONTRIBUTIONS

Key Contributions

01
Backpropagation Through Time derivation
Recurrent Neural Networks and Long Short-Term Memory Networks gives a detailed derivation of Backpropagation Through Time, including gradients for ht, W, U, V, bi, and by over T time steps.
02
Survey of LSTM variants
Recurrent Neural Networks and Long Short-Term Memory Networks surveys original LSTM, vanilla LSTM, peephole connections, full gate recurrence, and later variants including projection layers and dynamic cortex memory.
03
Unified view of GRU and bidirectional models
Recurrent Neural Networks and Long Short-Term Memory Networks unifies GRU, minimal gated units, bidirectional RNN, bidirectional LSTM, and ELMo under a common gated dynamical-systems framework.

RESULTS

By the Numbers

Time steps T

T previous steps

Loss sums over T past steps in Backpropagation Through Time

Eigenvalue lambda

λ ≲ 1

Largest eigenvalue slightly less than one for close to identity weight matrix

Leaky unit tau

1 ≤ τj < ∞

Controls copying versus transforming each state dimension

Echo state weights

Fixed reservoir

Only output layer trained, avoiding gradient vanishing in recurrent weights

Recurrent Neural Networks and Long Short-Term Memory Networks is a tutorial and survey, so it emphasizes analytical quantities like eigenvalues λ, time horizon T, and leaky-unit τj rather than benchmark scores. These values clarify when RNNs suffer gradient issues and how LSTM, GRU, and echo state networks mitigate them.

BENCHMARK

Key hyperparameters and structural quantities discussed

Representative scales for sequence length T, eigenvalue λ, and leaky unit τj in Recurrent Neural Networks and Long Short-Term Memory Networks.

KEY INSIGHT

The Counterintuitive Finding

Recurrent Neural Networks and Long Short-Term Memory Networks notes that slightly contractive dynamics with λ ≲ 1 can be preferable to exactly preserving all past information.

This is counterintuitive because many assume perfect memory is ideal, but the survey argues that controlled forgetting often improves learning and stability.

WHY IT MATTERS

What this unlocks for the field

Recurrent Neural Networks and Long Short-Term Memory Networks gives practitioners a coherent recipe book for choosing between RNN, LSTM, GRU, bidirectional variants, and echo state networks.

Armed with this, builders can design sequence models that deliberately trade off memory length, stability, and complexity instead of treating gated architectures as black boxes.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…