Skip to yearly menu bar Skip to main content

Workshop: Causal Representation Learning

Inverted-Attention Transformers can Learn Object Representations: Insights from Slot Attention

Yi-Fu Wu · Klaus Greff · Gamaleldin Elsayed · Michael Mozer · Thomas Kipf · Sjoerd van Steenkiste

Keywords: [ slot attention ] [ Causality-inspired representation learning ] [ unsupervised object-centric learning ]


Visual reasoning is supported by a causal understanding of the physical world, and theories of human cognition suppose that a necessary step to causal understanding is the discovery and representation of high-level entities like objects. Slot Attention is a popular method aimed at object-centric learning, and its popularity has resulted in dozens of variants and extensions. To help understand the core assumptions that lead to successful object-centric learning, we take a step back and identify the minimal set of changes to a standard Transformer architecture to obtain the same performance as the specialized Slot Attention models. We systematically evaluate the performance and scaling behaviour of several "intermediate" architectures on seven image and video datasets from prior work. Our analysis reveals that by simply inverting the attention mechanism of Transformers, we obtain performance competitive with state-of-the-art Slot Attention in several domains.

Chat is not available.