FLAM: Scaling Latent Action Models with Factorization
Abstract
Learning from unlabeled video has emerged as a powerful paradigm for training world models without action supervision. However, existing approaches often rely on monolithic inverse and forward dynamics models, which struggle to scale in settings where different entities act simultaneously. In this work, we propose a factored dynamics framework FLAM that decomposes the latent state into in- dependent factors, each with its own inverse and forward model. This structure enables more accurate modeling of complex, multi-entity dynamics and improves prediction quality in action-free video settings. Evaluated on Multigrid, Procgen, nuPlan and Sports datasets, we outperform the monolithic dynamics model and prove the superiority of the factorized framework.