Advancing Open-source World Models

AuthorsRobbyant Team, Zelin Gao, Qiuyu Wang et al.

arXiv 20262026

TL;DR

LingBot-World uses a hierarchical data engine plus a three-stage evolutionary training pipeline to reach a dynamic degree of 0.8857 on VBench, surpassing Yume-1.5 by 0.1245.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

From short video clips to persistent worlds with long generation horizons

Existing video generators produce only short, seconds-long clips and suffer from catastrophic forgetting over minute-long trajectories, breaking narrative and structural coherence.

This limits interactive world simulation, where agents need long-term memory, action-contingent dynamics, and real-time control for content creation, gaming, and robot learning.

HOW IT WORKS

LingBot-World — multi-stage evolution from video prior to world simulator

LingBot-World centers on a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training stage to evolve a Wan2.2 prior into an interactive simulator.

Think of LingBot-World as starting from a powerful video "camera" and gradually wiring in a game engine, controller, and real-time renderer around it.

This staged evolution lets LingBot-World maintain minute-level consistency, follow explicit actions, and run at sub-second latency, which a plain diffusion context window cannot achieve.

DIAGRAM

LingBot-World interactive rollout and control flow

This diagram shows how LingBot-World consumes prompts, actions, and history to generate interactive video in real time.

DIAGRAM

LingBot-World data engine and profiling pipeline

This diagram shows how LingBot-World constructs training data via acquisition, profiling, and hierarchical captioning.

PROCESS

How LingBot-World Handles an Interactive World Simulation Session

01
Pre-Training
During Pre-Training, LingBot-World initializes from the 14B Wan2.2 image to video diffusion model to build a strong general video prior.
02
Middle-Training
In Middle-Training, LingBot-World trains the Fundamental World Model with MoE and curriculum, then finetunes the Action-Conditioned World Model using adaptive normalization.
03
Post-Training
In Post-Training, LingBot-World applies causal architecture adaptation and few step distillation to create LingBot-World-Fast with real-time autoregressive rollout.
04
Promptable World Events
At inference, LingBot-World uses promptable world events and the action agent to steer global styles and local events while maintaining long-term consistency.

KEY CONTRIBUTIONS

Key Contributions

01
Data Engine with Hierarchical Semantics
LingBot-World introduces a Data Engine that unifies general videos, game data, and Unreal Engine renders with hierarchical captioning for narrative, scene static, and dense temporal captions.
02
Multi-Stage Evolutionary Training Pipeline
LingBot-World proposes a three stage pipeline of Pre-Training, Middle-Training, and Post-Training that evolves a Wan2.2 prior into a causal, real time world simulator.
03
Versatile Applications for Embodied AI
LingBot-World enables promptable world events, an action agent based on Qwen3-VL-2B, and 3D reconstruction, supporting ultra long videos up to ten minutes with emergent spatial memory.

RESULTS

By the Numbers

Imaging Quality

0.6683

+0.085 to HY-World 1.5

Aesthetic Quality

0.5660

+0.0473 to Yume-1.5

Dynamic Degree

0.8857

+0.1245 to Yume-1.5

Overall Consistency

0.2178

+0.0162 to HY-World 1.5

On VBench, which evaluates imaging quality, aesthetics, motion, and consistency for videos over 30 seconds, LingBot-World achieves the highest dynamic degree and imaging quality among Yume-1.5 and HY-World 1.5.

BENCHMARK

By the Numbers

BENCHMARK

Quantitative comparisons on VBench

Dynamic Degree scores on VBench for interactive world models.

KEY INSIGHT

The Counterintuitive Finding

LingBot-World maintains coherent environments for up to ten minutes while still running at 16 frames per second with sub second latency.

This is surprising because diffusion based video generators are typically too slow and unstable for real time, minute level interactive rollouts.

WHY IT MATTERS

What this unlocks for the field

LingBot-World unlocks open source, real time, action conditioned world simulation with emergent long term memory and implicit 3D consistency.

Builders can now prototype interactive games, simulators, and embodied AI training worlds without relying on closed source engines or handcrafted environments.

~15 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

arXiv:2604.12179 Read explainer

Long-Term Memory

All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution

Can Lv, Heng Chang et al.

· 2026

All-Mem organizes long-term agent memory through Online/Offline Decoupling, Agentic Topology Consolidation, and Topology-Aware Retrieval over a topology-structured memory bank. On LoCoMo, All-Mem reaches 54.63 4o-J versus Mem0’s 48.91, and on LongMemEval-S All-Mem reaches 60.20 4o-J versus Mem0’s 55.80.

arXiv:2603.19595 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…