LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

AuthorsKeqin Xie

2026

TL;DR

LPC-SM uses Orthogonal Novelty Transport in a dual-timescale sparse memory block to keep 4096-token training stable with Stage-C LM loss 11.582.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Attention-centric blocks hide long-range state and control (LM loss 12.630 vs 15.127 without mHC)

Most long-context language models still rely on attention to handle both local interaction and long-range state, limiting alternative decompositions.

When LPC-SM removes mHC, Stage-A final LM loss jumps from 12.630 to 15.127, showing that monolithic attention-centered blocks can become brittle and hard to optimize.

HOW IT WORKS

LPC-SM block with dual-timescale memory and predictive correction

LPC-SM builds each block from local attention, dual-timescale memory, predictive correction, Orthogonal Novelty Transport, and multi-head-coupled residual router mHC.

You can think of local attention and fast memory as RAM, slow memory as a disk, and ONT as a smart deduplicating write-back cache.

By explicitly routing novelty through ONT and exposing mismatch via predictive correction, LPC-SM can regulate sparse computation and persistent state in ways a plain context window cannot.

DIAGRAM

Dual-timescale memory and ONT write flow

This diagram shows how LPC-SM updates fast and slow memory using chunk summaries and Orthogonal Novelty Transport at chunk boundaries.

DIAGRAM

Staged training and ablation design for LPC-SM

This diagram shows the three-stage training schedule and Stage-A ablations used to evaluate LPC-SM.

PROCESS

How LPC-SM Handles Autoregressive Long-Context Generation

  1. 01

    Block Structure

    LPC-SM first normalizes embeddings and routes them through local attention, dual-timescale memory, predictive correction, and optional mHC inside each block.

  2. 02

    Dual-Timescale Memory and ONT

    LPC-SM updates fast state every token, forms chunk summaries, and uses Orthogonal Novelty Transport to write novelty into slow memory at chunk boundaries.

  3. 03

    Correction and Stopping

    LPC-SM predicts hidden states from attention and memory, refines them via predictive correction, and uses a learned stop head to model EOS behavior.

  4. 04

    Training Objective

    LPC-SM optimizes LM loss plus auxiliary terms for predictive correction, sparsity, memory magnitude, and stopping to keep explicit mechanisms active.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    LPC-SM block decomposition

    LPC-SM cleanly separates local attention, dual-timescale memory, predictive correction, and mHC routing, enabling a broader division of labor than attention alone.

  • 02

    Orthogonal Novelty Transport

    LPC-SM introduces ONT for slow-memory writes, amplifying only orthogonal novelty so memory preserves aligned content while emphasizing genuinely new information.

  • 03

    Adaptive sparse control in continuation

    LPC-SM uses learned sparse control that improves Stage-B final LM loss from 12.137 to 10.787 over a fixed sparse controller on OpenWebMath-10k.

RESULTS

By the Numbers

Stage-A final LM loss

12.630

+2.497 over w o mHC

Stage-B final LM loss

10.787

+1.350 over fixed sparse control

Stage-C final LM loss

11.582

4096 token continuation stability

Delayed identifier CE

12.031

-2.365 after Stage C continuation

Stage A trains LPC-SM on Dolma3-base, Stage B on OpenWebMath-10k, and Stage C on LongMino continuation. The 12.517% LM loss improvement in Stage B shows LPC-SM's adaptive sparse control materially reshapes computation under mathematical continuation.

BENCHMARK

By the Numbers

Stage A trains LPC-SM on Dolma3-base, Stage B on OpenWebMath-10k, and Stage C on LongMino continuation. The 12.517% LM loss improvement in Stage B shows LPC-SM's adaptive sparse control materially reshapes computation under mathematical continuation.

BENCHMARK

Stage-A Ablations at 158M Parameters

Final LM loss for LPC-SM and key Stage-A ablations on Dolma3-base.

BENCHMARK

Stage-B Adaptive vs Fixed Sparse Control

Final LM loss for LPC-SM with adaptive sparse control versus fixed sparse control on OpenWebMath-10k.

KEY INSIGHT

The Counterintuitive Finding

Disabling ONT in Stage A reduces final LM loss from 12.630 to 11.781, even though ONT is designed to improve slow-memory writes.

This is counterintuitive because ONT explicitly preserves aligned content and amplifies novelty, yet short-budget LM loss suggests that this structured memory constraint can initially hurt optimization.

WHY IT MATTERS

What this unlocks for the field

LPC-SM shows that long-context autoregressive modeling can be organized around explicit local attention, dual-timescale memory, predictive correction, and internal control instead of a single attention mechanism.

Builders can now experiment with controllable sparsity, novelty-aware memory writes, and learned stopping inside one coherent block, probing behaviors beyond what extended context windows alone can offer.

~14 min read← Back to papers

Related papers

Long-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

Questions about this paper?

Paper: LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

Answers use this explainer on Memory Papers.

Checking…