MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors

AuthorsTian-Xing Xu, Yuan-Chen Guo, Yu-Kun Lai, Song-Hai Zhang

arXiv 20232023

TL;DR

MBPTrack uses a decoupling memory network plus box-prior localization to reach 70.3/87.9 Success/Precision on KITTI, +2.8/+2.6 over CXTrack.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

3D point cloud trackers ignore history and object size variation

Existing 3D SOT methods only propagate target cues from the latest frame, neglecting rich information in other past frames and struggling with occlusion.

Size and geometry differences between categories mean voxel-based heads work well for vehicles but degrade on pedestrians, where voxelization causes information loss and poor localization.

HOW IT WORKS

MBPTrack — memory-based tracking with box priors

MBPTrack uses a Decoupling Feature Propagation Module, BPLocNet, box-prior sampling, and point-to-reference feature transformation to fuse temporal memory with size-aware localization.

You can think of MBPTrack as combining a long-term RAM of past frames with a 3D grid that reshapes sparse points into a dense feature volume around the target.

This design lets MBPTrack exploit historical spatial context and box priors to handle occlusion and size variation, beyond what a single-frame Siamese context window can capture.

DIAGRAM

Coarse to fine localization with box-prior sampling

This diagram shows how MBPTrack’s BPLocNet predicts centers, samples box-prior reference points, aggregates features, and refines 3D bounding boxes.

DIAGRAM

Temporal memory usage and ablation on memory size

This diagram shows how MBPTrack varies the number of memory frames and how that affects tracking performance in the ablation study.

PROCESS

How MBPTrack Handles 3D Single Object Tracking

  1. 01

    Backbone feature extraction

    MBPTrack applies the shared backbone to each frame Pt to produce point features Xi that encode local geometric information for later propagation.

  2. 02

    Decoupling Feature Propagation Module

    The Decoupling Feature Propagation Module uses cross and self attention to propagate geometric features X and mask features Y from memory frames into the current frame.

  3. 03

    Box-Prior Localization Network

    BPLocNet predicts target centers and targetness masks from fused features F, then uses box-prior sampling and point-to-reference transformation to build dense 3D feature maps.

  4. 04

    Coarse to fine score prediction

    MBPTrack’s 3D CNN and quality score head jointly refine proposal-wise features to output bounding box parameters Bt and refined scores S for final tracking.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Memory-based 3D SOT with DeFPM

    MBPTrack is the first to exploit both spatial and temporal contextual information in 3D SOT using the Decoupling Feature Propagation Module with shared attention maps across frames.

  • 02

    Box-Prior Localization Network

    MBPTrack introduces BPLocNet, a coarse-to-fine localization network that uses box-prior sampling and point-to-reference transformation to handle targets of different sizes.

  • 03

    State-of-the-art multi-dataset results

    MBPTrack reaches 70.3/87.9 Success/Precision on KITTI and 57.48/69.88 on NuScenes mean, surpassing M2-Track by +8.25/+7.15 on NuScenes.

RESULTS

By the Numbers

KITTI Mean Success

70.3

+2.8 over CXTrack

KITTI Mean Precision

87.9

+2.6 over CXTrack

NuScenes Mean Success

57.48

+8.25 over M2-Track

Waymo Mean Success

46.0

+1.4 over M2Track

On KITTI, NuScenes, and Waymo Open Dataset, MBPTrack is evaluated with Success and Precision metrics, showing that memory plus box priors yield consistent gains over Siamese and motion-centric baselines.

BENCHMARK

By the Numbers

On KITTI, NuScenes, and Waymo Open Dataset, MBPTrack is evaluated with Success and Precision metrics, showing that memory plus box priors yield consistent gains over Siamese and motion-centric baselines.

BENCHMARK

KITTI 3D SOT: Mean Success comparison

Mean Success (%) on KITTI across all categories from Table 1.

BENCHMARK

NuScenes 3D SOT: Mean Success comparison

Mean Success (%) on NuScenes from Table 3.

KEY INSIGHT

The Counterintuitive Finding

Using three memory frames gives MBPTrack the best KITTI mean performance (70.3/87.9), while increasing memory to six frames drops it to 68.4/85.9.

It is surprising that more temporal context can hurt, contradicting the intuition that longer histories always help, and showing that stale or low-quality frames can dilute useful cues.

WHY IT MATTERS

What this unlocks for the field

MBPTrack shows that combining decoupled temporal memory with box-prior guided dense feature maps enables robust 3D tracking under occlusion and across object sizes.

Builders can now plug MBPTrack-style memory and BPLocNet heads into existing 3D SOT frameworks to boost accuracy on cars and pedestrians without sacrificing real-time 50 FPS performance.

~12 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

Questions about this paper?

Paper: MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors

Answers use this explainer on Memory Papers.

Checking…