Skip to yearly menu bar Skip to main content


Talk
in
Workshop: Representation Learning in Artificial and Biological Neural Networks

Sven Eberhardt - More Feedback, Less Depth: Approximating Human Vision with Deep Networks.

Sven Eberhardt


Abstract:

Recent advances in Deep Convolutional Networks (DCNs) supporting increasingly deep architectures have demonstrated significant gains in object recognition accuracy when trained on large labeled image databases. While a growing body of work indicates this surge in DCN performance carries concomitant improvement in fitting both neural data in higher areas of the primate visual cortex and human psychophysical data during object recognition, key differences remain. To investigate these differences, we assess the correlation between computational models and human behavioral responses on a rapid animal vs. non-animal categorization task. We find that DCN recognition accuracy increases with higher stages of visual processing (higher level stages indeed outperforming human participants on the same task) but that human decisions agree best with predictions from intermediate stages. These results suggest that while DCNs properly model visual features of intermediate complexity as used by the human visual system, more advanced visual processing relies on mechanisms not captured by these models. What kind of features do humans and DCNs base object decisions off of? To test this, we introduce a competitive web-based game for discovering features that humans use for object recognition: One participant from a pair sequentially reveals parts of an object in an image until the other correctly identifies its category. Scoring image regions according to their proximity to correct recognition yields maps of visual feature importance for individual images. We find that these ``realization'' maps exhibit only weak correlation with relevance maps derived from DCNs or image salience algorithms. Cueing DCNs to attend to features emphasized by these maps improves their object recognition accuracy. Our results thus suggest that realization maps identify visual features that humans deem important for object recognition but are not adequately captured by DCNs. Finally, we suggest a novel DCN training approach in which we base our representation on object and surface structure, rather than picture class labels, to build a more human-like visual representation.

Live content is unavailable. Log in and register to view live content