Timezone: »

Learning Physical Graph Representations from Visual Scenes
Daniel Bear · Chaofei Fan · Damian Mrowca · Yunzhu Li · Seth Alter · Aran Nayebi · Jeremy Schwartz · Li Fei-Fei · Jiajun Wu · Josh Tenenbaum · Daniel Yamins

Mon Dec 07 09:00 PM -- 11:00 PM (PST) @ Poster Session 0 #131

Convolutional Neural Networks (CNNs) have proved exceptional at learning representations for visual object categorization. However, CNNs do not explicitly encode objects, parts, and their physical properties, which has limited CNNs' success on tasks that require structured understanding of visual scenes. To overcome these limitations, we introduce the idea of ``Physical Scene Graphs'' (PSGs), which represent scenes as hierarchical graphs, with nodes in the hierarchy corresponding intuitively to object parts at different scales, and edges to physical connections between parts. Bound to each node is a vector of latent attributes that intuitively represent object properties such as surface shape and texture. We also describe PSGNet, a network architecture that learns to extract PSGs by reconstructing scenes through a PSG-structured bottleneck. PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures; and perceptual grouping principles to encourage the identification of meaningful scene elements. We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks, especially on complex real-world images, and generalizes well to unseen object types and scene arrangements. PSGNet is also able learn from physical motion, enhancing scene estimates even for static images. We present a series of ablation studies illustrating the importance of each component of the PSGNet architecture, analyses showing that learned latent attributes capture intuitive scene properties, and illustrate the use of PSGs for compositional scene inference.

Author Information

Daniel Bear (Stanford University)
Chaofei Fan (Stanford)
Damian Mrowca (Stanford University)

Young children are excellent at playing, an ability to explore and (re)structure their environment that allows them to develop a remarkable visual and physical representation of their world that sets them apart from even the most advanced robots. Damian Mrowca is studying (1) representations and architectures that allow machines to efficiently develop an intuitive physical understanding of their world and (2) mechanisms that allow agents to learn such representations in a self-supervised way. Damian is a 3rd year PhD student co-advised by Prof. Fei-Fei Li and Prof. Daniel Yamins. He received his BSc (2012) and MSc (2015) in Electrical Engineering and Information Theory, both from the Technical University of Munich. During 2014-2015 he was a visiting student with Prof. Trevor Darrell at UC Berkeley. After a year in start-up land, looking to apply his research in businesses, he joined the Stanford Vision Lab and NeuroAILab in September 2016.

Yunzhu Li (MIT)
Seth Alter (MIT)
Aran Nayebi (Stanford University)
Jeremy Schwartz (MIT)
Li Fei-Fei (Stanford University & Google)
Jiajun Wu (Stanford University)
Josh Tenenbaum (MIT)

Josh Tenenbaum is an Associate Professor of Computational Cognitive Science at MIT in the Department of Brain and Cognitive Sciences and the Computer Science and Artificial Intelligence Laboratory (CSAIL). He received his PhD from MIT in 1999, and was an Assistant Professor at Stanford University from 1999 to 2002. He studies learning and inference in humans and machines, with the twin goals of understanding human intelligence in computational terms and bringing computers closer to human capacities. He focuses on problems of inductive generalization from limited data -- learning concepts and word meanings, inferring causal relations or goals -- and learning abstract knowledge that supports these inductive leaps in the form of probabilistic generative models or 'intuitive theories'. He has also developed several novel machine learning methods inspired by human learning and perception, most notably Isomap, an approach to unsupervised learning of nonlinear manifolds in high-dimensional data. He has been Associate Editor for the journal Cognitive Science, has been active on program committees for the CogSci and NIPS conferences, and has co-organized a number of workshops, tutorials and summer schools in human and machine learning. Several of his papers have received outstanding paper awards or best student paper awards at the IEEE Computer Vision and Pattern Recognition (CVPR), NIPS, and Cognitive Science conferences. He is the recipient of the New Investigator Award from the Society for Mathematical Psychology (2005), the Early Investigator Award from the Society of Experimental Psychologists (2007), and the Distinguished Scientific Award for Early Career Contribution to Psychology (in the area of cognition and human learning) from the American Psychological Association (2008).

Daniel Yamins (Stanford University)

Related Events (a corresponding poster, oral, or spotlight)

More from the Same Authors