LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

AuthorsKeqin Xie

2026

TL;DR

LPC-SM uses Orthogonal Novelty Transport in a dual-timescale sparse memory block to keep 4096-token training stable with Stage-C LM loss 11.582.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Attention-centric blocks hide long-range state and control (LM loss 12.630 vs 15.127 without mHC)

Most long-context language models still rely on attention to handle both local interaction and long-range state, limiting alternative decompositions.

When LPC-SM removes mHC, Stage-A final LM loss jumps from 12.630 to 15.127, showing that monolithic attention-centered blocks can become brittle and hard to optimize.

HOW IT WORKS

LPC-SM block with dual-timescale memory and predictive correction

LPC-SM builds each block from local attention, dual-timescale memory, predictive correction, Orthogonal Novelty Transport, and multi-head-coupled residual router mHC.

You can think of local attention and fast memory as RAM, slow memory as a disk, and ONT as a smart deduplicating write-back cache.

By explicitly routing novelty through ONT and exposing mismatch via predictive correction, LPC-SM can regulate sparse computation and persistent state in ways a plain context window cannot.

DIAGRAM

Dual-timescale memory and ONT write flow

This diagram shows how LPC-SM updates fast and slow memory using chunk summaries and Orthogonal Novelty Transport at chunk boundaries.

DIAGRAM

Staged training and ablation design for LPC-SM

This diagram shows the three-stage training schedule and Stage-A ablations used to evaluate LPC-SM.

PROCESS

How LPC-SM Handles Autoregressive Long-Context Generation

01
Block Structure
LPC-SM first normalizes embeddings and routes them through local attention, dual-timescale memory, predictive correction, and optional mHC inside each block.
02
Dual-Timescale Memory and ONT
LPC-SM updates fast state every token, forms chunk summaries, and uses Orthogonal Novelty Transport to write novelty into slow memory at chunk boundaries.
03
Correction and Stopping
LPC-SM predicts hidden states from attention and memory, refines them via predictive correction, and uses a learned stop head to model EOS behavior.
04
Training Objective
LPC-SM optimizes LM loss plus auxiliary terms for predictive correction, sparsity, memory magnitude, and stopping to keep explicit mechanisms active.

KEY CONTRIBUTIONS

Key Contributions

01
LPC-SM block decomposition
LPC-SM cleanly separates local attention, dual-timescale memory, predictive correction, and mHC routing, enabling a broader division of labor than attention alone.
02
Orthogonal Novelty Transport
LPC-SM introduces ONT for slow-memory writes, amplifying only orthogonal novelty so memory preserves aligned content while emphasizing genuinely new information.
03
Adaptive sparse control in continuation
LPC-SM uses learned sparse control that improves Stage-B final LM loss from 12.137 to 10.787 over a fixed sparse controller on OpenWebMath-10k.

RESULTS

By the Numbers

Stage-A final LM loss

12.630

+2.497 over w o mHC

Stage-B final LM loss

10.787

+1.350 over fixed sparse control

Stage-C final LM loss

11.582

4096 token continuation stability

Delayed identifier CE

12.031

-2.365 after Stage C continuation

Stage A trains LPC-SM on Dolma3-base, Stage B on OpenWebMath-10k, and Stage C on LongMino continuation. The 12.517% LM loss improvement in Stage B shows LPC-SM's adaptive sparse control materially reshapes computation under mathematical continuation.

BENCHMARK

By the Numbers

BENCHMARK

Stage-A Ablations at 158M Parameters

Final LM loss for LPC-SM and key Stage-A ablations on Dolma3-base.

BENCHMARK

Stage-B Adaptive vs Fixed Sparse Control

Final LM loss for LPC-SM with adaptive sparse control versus fixed sparse control on OpenWebMath-10k.

KEY INSIGHT

The Counterintuitive Finding

Disabling ONT in Stage A reduces final LM loss from 12.630 to 11.781, even though ONT is designed to improve slow-memory writes.

This is counterintuitive because ONT explicitly preserves aligned content and amplifies novelty, yet short-budget LM loss suggests that this structured memory constraint can initially hurt optimization.

WHY IT MATTERS

What this unlocks for the field

LPC-SM shows that long-context autoregressive modeling can be organized around explicit local attention, dual-timescale memory, predictive correction, and internal control instead of a single attention mechanism.

Builders can now experiment with controllable sparsity, novelty-aware memory writes, and learned stopping inside one coherent block, probing behaviors beyond what extended context windows alone can offer.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

arXiv:2601.20540 Read explainer

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

arXiv:2604.12179 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…