Skip to yearly menu bar Skip to main content

Workshop: Causal Representation Learning

Learning Object Motion and Appearance Dynamics with Object-Centric Representations

Yeon-Ji Song · Hyunseo Kim · Suhyung Choi · Jin-Hwa Kim · Byoung-Tak Zhang

Keywords: [ cross-attention ] [ dynamics prediction ] [ Object-centric learning ] [ transformer ]


Human perception involves discerning objects based on attributes such as size, color, and texture, and making predictions about their movements using features such as weight and speed. This innate ability operates without the need for conscious learning, allowing individuals to perform actions like catching or avoiding objects when they are unaware. Accordingly, the fundamental key to achieving higher-level cognition lies in the capability to break down intricate multi-object scenes into meaningful appearances. Object-centric representations have emerged as a promising tool for scene decomposition by providing useful abstractions. In this paper, we propose a novel approach to unsupervised video prediction leveraging object-centric representations. Our methodology introduces a two-component model consisting of a slot encoder for object-centric disentanglement and a feature extraction module for masked patches. These components are integrated through a cross-attention mechanism, allowing for comprehensive spatio-temporal reasoning. Our model exhibits better performance when dealing with intricate scenes characterized by a wide range of object attributes and dynamic movements. Moreover, our approach demonstrates scalability across diverse synthetic environments, thereby showcasing its potential for widespread utilization in vision-related tasks.

Chat is not available.