Hierarchical Neural Memory Network for Low Latency Event Processing

AuthorsRyuhei Hamaguchi, Yasutaka Furukawa, Masaki Onishi, Ken Sakurada

arXiv 20232023

TL;DR

Hierarchical Neural Memory Network (HMNet) uses multi-rate latent memories plus Event Sparse Cross Attention to cut latency by about 40–50% while matching or beating prior accuracy on event-based dense prediction.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Low latency event cameras but frame-based backbones still run at 30–50 ms

Conventional dense prediction architectures encode entire scene contents at a fixed rate, ignoring motion speed and wasting computation on static regions.

For example, a vehicle traveling 80 km per hour moves 74 cm within one 30 fps RGB frame, so frame-based backbones miss fast motion and increase safety-critical latency despite event cameras’ microsecond resolution.

HOW IT WORKS

Hierarchical Neural Memory Network with multi-rate latent memories

Hierarchical Neural Memory Network (HMNet) builds a stack of latent memories z1–z3 with Event-write, Up-write, Down-write, Update, and Readout operations, plus Event Sparse Cross Attention (ESCA) to inject events.

You can think of z1 as fast RAM for dynamic, local motion, while z2 and z3 act like progressively slower but deeper caches that accumulate static, global context.

By running memories at different cycles (1, 3, 9 steps) and writing sparse events directly via ESCA, HMNet maintains long-term temporal context without re-running a full backbone every 5 ms.

DIAGRAM

Asynchronous multi-rate operation of HMNet memories over time

This diagram illustrates how HMNet runs z1, z2, and z3 at different cycles (1, 3, 9 time steps) and feeds a latent buffer for low-latency predictions.

DIAGRAM

Evaluation pipeline across DSEC-Semantic, GEN1, and MVSEC

This diagram shows how HMNet variants are configured and evaluated on semantic segmentation, object detection, and monocular depth estimation with latency measurement.

PROCESS

How HMNet Handles Low Latency Event Processing

  1. 01

    Event-write

    HMNet receives events every 5 ms, embeds them with Event Sparse Cross Attention, and writes into the fastest latent memory z1 within local windows.

  2. 02

    Up-write

    HMNet periodically propagates features from z1 to z2 and from z2 to z3 using window based multi-head cross-attention and strided convolutions.

  3. 03

    Down-write

    HMNet sends top-down messages from z3 and z2 back to z2 and z1, using cross-attention and bilinear upsampling to inject global context into dynamic memories.

  4. 04

    Update and Readout

    HMNet updates each memory with residual layers according to its cycle, then applies Readout to fill a latent buffer that the task head uses every time step.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Hierarchical Neural Memory Network

    HMNet introduces multi-level latent memories z1–z3 with cycles of 1, 3, and 9 time steps, reducing redundant computation while preserving long-term temporal dependencies.

  • 02

    Event Sparse Cross Attention

    HMNet proposes ESCA with an event gate to embed sparse events into memory cells, enabling HMNet-B1 to match AED on GEN1 while cutting latency by 57%.

  • 03

    Low latency event and image fusion

    HMNet extends naturally to event–image fusion, achieving 57.4 mIoU on DSEC-Semantic with HMNet-L3 using left or right RGB despite viewpoint misalignment.

RESULTS

By the Numbers

mIoU

57.4 %

+3.3 over ResNet-50 baseline on DSEC-Semantic with fusion

Latency @ Tesla V100

5.0 ms

44% lower than RAMNet at 9.0 ms on MVSEC outdoor day1

mAP

44.7 %

competitive with AED on GEN1 while HMNet-B1 reduces latency by 57%

Time step size range

3–10 ms

HMNet maintains accuracy when inference step size varies between 3 ms and 10 ms

These numbers come from DSEC-Semantic for segmentation, GEN1 for object detection, and MVSEC for monocular depth estimation, showing that HMNet sustains accuracy while substantially reducing latency compared to strong CNN and recurrent baselines.

BENCHMARK

By the Numbers

These numbers come from DSEC-Semantic for segmentation, GEN1 for object detection, and MVSEC for monocular depth estimation, showing that HMNet sustains accuracy while substantially reducing latency compared to strong CNN and recurrent baselines.

BENCHMARK

Semantic segmentation on DSEC-Semantic with event-only and fusion

mIoU on DSEC-Semantic for HMNet and ResNet-50 baselines under different input modalities.

KEY INSIGHT

The Counterintuitive Finding

HMNet-B1, with only a single latent memory and ESCA, performs competitively to AED on GEN1 while reducing latency by 57%.

This is surprising because AED uses sophisticated adaptive voxel grids, yet HMNet’s simpler attention-based event-write plus hierarchy matches accuracy with far less computation.

WHY IT MATTERS

What this unlocks for the field

HMNet enables dense prediction systems that react at 5 ms scales while still reasoning over long temporal windows using multi-rate memories.

Builders can now design event-based perception stacks for autonomous driving, AR, and robotics that meet strict latency budgets without sacrificing segmentation, detection, or depth accuracy.

~14 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

Questions about this paper?

Paper: Hierarchical Neural Memory Network for Low Latency Event Processing

Answers use this explainer on Memory Papers.

Checking…