Muon Outperforms Adam in Tail-End Associative Memory Learning

AuthorsShuche Wang, Fengzhuo Zhang, Jiaxiang Li et al.

2025

TL;DR

Muon Outperforms Adam in Tail-End Associative Memory Learning shows that Muon’s isotropic matrix updates let associative memories learn tail classes more evenly than Adam on heavy-tailed data.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Optimizers Struggle With Tail-End Learning in Heavy-Tailed Data

Muon Outperforms Adam in Tail-End Associative Memory Learning highlights that real-world corpora are intrinsically heavy-tailed, where a few classes appear far less frequently than others.

Under these class-imbalanced conditions, Adam can induce large disparities in learning errors across classes, leaving tail classes poorly learned despite good head performance.

HOW IT WORKS

Associative Memory View of Muon in Transformers

Muon Outperforms Adam in Tail-End Associative Memory Learning focuses on VO attention weights, Feed-Forward Networks (FFNs), and language model head as the main associative memory parameters optimized by Muon.

You can think of these associative memories like a matrix card catalog, where each stored fact is an outer product card and Muon normalizes how strongly each card is updated.

This outer-product-aligned update lets Muon Outperforms Adam in Tail-End Associative Memory Learning maintain more isotropic spectra than Adam, enabling balanced learning of rare facts that a plain context window and vector-norm optimizer cannot.

DIAGRAM

Update Flow for Associative Memory Gradients Under Muon

This diagram shows how Muon Outperforms Adam in Tail-End Associative Memory Learning transforms associative memory gradients into isotropic matrix updates via SVD based normalization.

DIAGRAM

Ablation Design for Applying Muon to Transformer Blocks

This diagram shows how Muon Outperforms Adam in Tail-End Associative Memory Learning structures independent-block and combined-configuration ablations over QK, VO, and FFN parameters.

PROCESS

How Muon Outperforms Adam in Tail-End Associative Memory Learning Handles Heavy-Tailed Next-Token Prediction

  1. 01

    Associative memory identification

    Muon Outperforms Adam in Tail-End Associative Memory Learning first identifies VO attention weights and FFN matrices as associative memory parameters storing subject relation object facts.

  2. 02

    Independent blocks ablation

    Muon Outperforms Adam in Tail-End Associative Memory Learning applies Muon separately to QK, VO, and FFN blocks in a 160M NanoGPT to measure validation loss impacts.

  3. 03

    Combined configurations on VO and FFN

    Muon Outperforms Adam in Tail-End Associative Memory Learning then jointly applies Muon to VO and FFN while keeping QK on Adam to approximate full-Muon performance.

  4. 04

    Heavy tailed knowledge evaluation

    Muon Outperforms Adam in Tail-End Associative Memory Learning finally evaluates First Token Accuracy on a power law biographical QA dataset to compare head and tail class learning.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Associative memory centric view of Muon

    Muon Outperforms Adam in Tail-End Associative Memory Learning shows that VO attention weights and FFN blocks are the main beneficiaries of Muon, with VO+FFN nearly recovering full-Muon validation loss.

  • 02

    Spectral isotropy analysis of Muon

    Muon Outperforms Adam in Tail-End Associative Memory Learning demonstrates that Muon consistently yields higher normalized SVD entropy and effective rank than Adam for VO and Wout across seeds.

  • 03

    Theoretical balanced learning guarantees

    Muon Outperforms Adam in Tail-End Associative Memory Learning proves that Muon maintains ϱϵ_Muon ≥ 1 − ϵ(1 + O(log K K−1)) under class imbalance, while GD and Adam can have ϱϵ scaling like O(ϵ−r(α,β)Kr(α,β)−1).

RESULTS

By the Numbers

Validation loss at 10000 steps

3.5654

-0.3588 over All Adam

Validation loss VO FFN Muon

3.5858

-0.3384 vs All Adam

Validation loss QK Muon

3.8925

-0.0317 context independent blocks non gated FFN

Validation loss All Adam

3.9242

baseline NanoGPT 160M FineWeb non gated FFN

On the FineWeb dataset with a 160M NanoGPT, Muon Outperforms Adam in Tail-End Associative Memory Learning achieves validation loss 3.5654 versus 3.9242 for All Adam at 10000 steps. This shows that applying Muon to associative memories yields faster convergence and that VO+FFN with Muon (3.5858) nearly matches full-Muon performance.

BENCHMARK

By the Numbers

On the FineWeb dataset with a 160M NanoGPT, Muon Outperforms Adam in Tail-End Associative Memory Learning achieves validation loss 3.5654 versus 3.9242 for All Adam at 10000 steps. This shows that applying Muon to associative memories yields faster convergence and that VO+FFN with Muon (3.5858) nearly matches full-Muon performance.

BENCHMARK

Validation loss at 10000 training steps on 160M NanoGPT (non-gated FFN)

Validation loss after 10000 steps on FineWeb for different optimizer configurations on attention and FFN blocks.

KEY INSIGHT

The Counterintuitive Finding

Muon Outperforms Adam in Tail-End Associative Memory Learning finds that applying Muon only to VO and FFN reaches validation loss 3.5858, extremely close to full Muon’s 3.5654.

This is surprising because QK and VO have the same parameter count, yet QK with Muon yields 3.8925, breaking the assumption that all attention projections benefit equally from matrix norm optimization.

WHY IT MATTERS

What this unlocks for the field

Muon Outperforms Adam in Tail-End Associative Memory Learning unlocks optimizers that explicitly align with associative memory outer products, yielding isotropic spectra and balanced tail-class learning.

This enables builders to selectively apply Muon to VO and FFN in large transformers, improving rare fact acquisition on heavy-tailed corpora without fully replacing Adam everywhere.

~14 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

Questions about this paper?

Paper: Muon Outperforms Adam in Tail-End Associative Memory Learning

Answers use this explainer on Memory Papers.

Checking…