Humanoid World Model

An open-source world model for humanoid robots.

Masked Transformer

Generated
Mask 1 Generated
Ground Truth
Mask 1 Ground Truth
Generated
Mask 2 Generated
Ground Truth
Mask 2 Ground Truth
Generated
Mask 3 Generated
Ground Truth
Mask 3 Ground Truth
Generated
Mask 4 Generated
Ground Truth
Mask 4 Ground Truth

Flow Matching

Generated
Flow 1 Generated
Ground Truth
Flow 1 Ground Truth
Generated
Flow 2 Generated
Ground Truth
Flow 2 Ground Truth
Generated
Flow 3 Generated
Ground Truth
Flow 3 Ground Truth
Generated
Flow 4 Generated
Ground Truth
Flow 4 Ground Truth

We present Humanoid World Models (HWM) πŸ€–πŸ§  β€” a family of lightweight, open-source video prediction models for humanoid robots β€” showcased at the ICML 2025 Workshop on Physically Plausible World Models. The model predicts 8 future video frames conditioned on 9 past video frames and humanoid control tokens (e.g., joint angles, velocities).

πŸ”— Full Paper

Overview

Motivation

World foundation models can predict high fidelity and physically plausible future video conditioned on the past. Such world models can substantially help the development of humanoid robots. In particular, world models can help evaluate humanoid policies, generate synthetic training data, learn general-purpose representations, and help conduct long-horizon planning. However, many world foundation models are closed-source and/or require large amounts of computational resources to run (e.g. 8+ H100s).

Key Features:

  • Two model families: While Masked-HWM achieves higher visual fidelity and faster inference, Flow-HWM offers benefits in continuous latent modeling. Our findings favor Masked-HWM for most practical deployments:
    • Masked-HWM: non-autoregressive masked video transformer using VQ-VAE latents.
    • Flow-HWM: continuous VAE latent space model using flow-matching.
  • Architecture Variants: Various attention styles (joint vs. split attention) and weight-sharing strategies were explored, providing up to 53% parameter reduction with minimal performance degradation.
  • Efficiency: Trained 3 Nvida A6000 GPUs.
  • Data: Trained on 100+ hours of egocentric humanoid video and control traces from the 1xGPT dataset.

Architecture Diagrams

Masked Video transformer block architecture
Masked Video Transformer
Flow-Matching transformer block architecture
Flow-Matching Model

Quantitative Performance

Model Variant FID ↓ PSNR ↑ Params (B) Samples/sec
Masked-HWM (Base) 10.13 29.02 dB 0.321 2.27
Masked-HWM (Full Sharing) 14.21 28.66 dB 0.195 2.36
Flow-HWM (Base) 111.59 20.42 dB 1.36 1.69
Flow-HWM (Full Sharing) 110.73 20.43 dB 0.648 1.91

Code

Please contact our team if you require the code urgently, a special repo can be sent for you. πŸ”— GitHub Repository (coming soon)

Citation

If you use this work, please cite our ICML workshop paper:

Ali, Q., Sridhar, A., Matiana, S., Wong, A., & Al-Sharman, M. (2025). Humanoid World Models: Open World Foundation Models for Humanoid Robotics. ICML Workshop on Physically Plausible World Models.


For code, data, or collaboration inquiries, contact Qasim Ali.

References

1X Technologies. (2024). 1X World Model Challenge (Versionβ€―1.1) [Data set].

References