Timezone: »

Workshop
Shared Visual Representations in Human and Machine Intelligence
Arturo Deza · Joshua Peterson · N Apurva Ratan Murty · Tom Griffiths

Mon Dec 13 06:45 AM -- 03:00 PM (PST) @

The goal of the 3rd Shared Visual Representations in Human and Machine Intelligence \textit{(SVRHM)} workshop is to disseminate relevant, parallel findings in the fields of computational neuroscience, psychology, and cognitive science that may inform modern machine learning. In the past few years, machine learning methods---especially deep neural networks---have widely permeated the vision science, cognitive science, and neuroscience communities. As a result, scientific modeling in these fields has greatly benefited, producing a swath of potentially critical new insights into the human mind. Since human performance remains the gold standard for many tasks, these cross-disciplinary insights and analytical tools may point towards solutions to many of the current problems that machine learning researchers face (\textit{e.g.,} adversarial attacks, compression, continual learning, and self-supervised learning). Thus we propose to invite leading cognitive scientists with strong computational backgrounds to disseminate their findings to the machine learning community with the hope of closing the loop by nourishing new ideas and creating cross-disciplinary collaborations. In particular, this year's version of the workshop will have a heavy focus on testing new inductive biases on novel datasets as we work on tasks that go beyond object recognition.

 Mon 6:45 a.m. - 7:00 a.m. Opening Remarks (Remarks)    Live Opening Remarks for SVRHM 2021 🔗 Mon 7:00 a.m. - 7:10 a.m. Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders (Oral)  link »    Category-selectivity in the brain describes the observation that certain spatially localized areas of the cerebral cortex tend to respond robustly and selectively to stimuli from specific limited categories. One of the most well known examples of category-selectivity is the Fusiform Face Area (FFA), an area of the inferior temporal cortex in primates which responds preferentially to images of faces when compared with objects or other generic stimuli. In this work, we leverage the newly introduced Topographic Variational Autoencoder to model of the emergence of such localized category-selectivity in an unsupervised manner. Experimentally, we demonstrate our model yields spatially dense neural clusters selective to faces, bodies, and places through visualized maps of Cohen's d metric. We compare our model with related supervised approaches, namely the TDANN, and discuss both theoretical and empirical similarities. Finally, we show preliminary results suggesting that our model yields a nested spatial hierarchy of increasingly abstract categories, analogous to observations from the human ventral temporal cortex. Link » T. Anderson Keller · Qinghe Gao · Max Welling 🔗 Mon 7:10 a.m. - 7:20 a.m. Learning to perceive objects by prediction (Oral)  link »    The representation of objects is the building block of higher-level concepts. Infants develop the notion of objects without supervision. The prediction error of future sensory input is likely the major teaching signal for infants. Inspired by this, we propose a new framework to extract object-centric representation from single 2D images by learning to predict future scenes in the presence of moving objects. We treat objects as latent causes of which the function for an agent is to facilitate efficient prediction of the coherent motion of their parts in visual input. Distinct from previous object-centric models, our model learns to explicitly infer objects' locations in a 3D environment in addition to segmenting objects. Further, the network learns a latent code space where objects with the same geometric shape and texture/color frequently group together. The model requires no supervision or pre-training of any part of the network. We created a new synthetic dataset with more complex textures on objects and background and found several previous models not based on predictive learning overly rely on clustering colors and lose specificity in object segmentation. Our work demonstrates a new approach for learning symbolic representation grounded in sensation and action. Link » Tushar Arora · Li Erran Li · Mingbo Cai 🔗 Mon 7:20 a.m. - 7:40 a.m. Yukiyasu Kamitani: "High-performance DNNs are not brain-like" (Invited Talk) 🔗 Mon 7:40 a.m. - 8:00 a.m. Roland Fleming: "Learning to See Stuff" (Invited Talk) 🔗 Mon 8:00 a.m. - 8:20 a.m. Gemma Roig: "Modeling the human brain from invariance and robustness to clutter towards multimodal, multi-task and continuous learning models" (Invited Talk) 🔗 Mon 8:20 a.m. - 8:40 a.m. Wieland Brendel: "How Well do Feature Visualizations Support Causal Understanding of CNN Activations?" (Invited Talk) 🔗 Mon 8:40 a.m. - 9:00 a.m. Stephane Deny: "Learning transformations from data via recurrent latent operators" (Invited Talk) 🔗 Mon 9:00 a.m. - 10:00 a.m. KDSalBox: A toolbox of efficient knowledge-distilled saliency models (Poster)  link » Dozens of saliency models have been designed over the last few decades, targeted at diverse applications ranging from image compression and retargeting to robot navigation, surveillance, and distractor detection. Barriers to their use include the different and often incompatible software environments that they rely on, as well as the computational inefficiency of older implementations. For application-purposes models are then frequently chosen based on convenience and efficiency, at the expense of optimizing for task performance. To facilitate the evaluation and selection of saliency models for different applications, we present KDSalBox - a toolbox of 10 knowledge-distilled saliency models. Using the original model implementations available in their native environments, we produce saliency training data for efficient MobileNet-based architectures, that are identical in their architecture but differ in how they learn to distribute saliency over an image. The resulting toolbox allows these 10 models to be efficiently run, compared, and be practically applied. Link » Ard Kastrati · Zoya Bylinskii · Eli Shechtman 🔗 Mon 9:00 a.m. - 10:00 a.m. Seeking the Building Blocks of Visual Imagery and Creativity in a Cognitively Inspired Neural Network (Poster)  link » How do we imagine visual objects, and combine them to create new forms? To answer this question, we need to explore the cognitive, computational and neural mechanisms underlying imagery and creativity. The body of research on deep learning models with creative behaviors is growing. However, in this paper we suggest that the complexity of such models and their training sets is an impediment to using them as tools to understand human aspects of creativity. We propose using simpler models, inspired by neural and cognitive mechanisms, that are trained with smaller data sets. We show that a standard deep learning architecture can demonstrate imagery by generating shape/color combinations using only symbolic codes as input. However, generating a new combination that was not experienced by the model was not possible. We discuss the limitations of such models, and explain how creativity could be embedded by incorporating mechanisms to transform the network’s output into new combinations and use that as new training data. Link » Shekoofeh Hedayati · Roger Beaty · Brad Wyble 🔗 Mon 9:00 a.m. - 10:00 a.m. V1 and IT representations are directly accessible to human visual perception (Poster)  link » Human visual recognition of complex patterns is supported by hierarchical representations in the ventral stream of visual cortex. However, it remains undetermined whether representations in early visual cortical areas, e.g. primary visual cortex (V1), are directly accessible by perception or are merely intermediates used only for the generation of more complex representations in higher-level visual areas, e.g. inferior temporal cortex (IT). Here, we constructed deep convolutional neural network (dCNN) based simulations of V1 and IT by linearly weighting dCNN features to maximize predictivity of electrophysiological responses. We used these cortical simulations to synthesize stimuli which linearly interpolate through either a V1- or IT-like feature space. In a visual discrimination task, we found that human observers are highly sensitive to variation through both V1 and IT representational spaces. We found that behavior on this task cannot be explained by an observer model that makes use of solely V1 features or IT features, but instead is best explained by a weighted combination of V1 and IT features. Our results thus provide evidence for the insufficiency of IT representations and the necessity of representations in both early and late regions of the ventral visual stream to support perception. Link » Akshay Jagadeesh · Justin Gardner 🔗 Mon 9:00 a.m. - 10:00 a.m. On the use of Cortical Magnification and Saccades as Biological Proxies for Data Augmentation (Poster)  link » Self-supervised learning is a strong way to learn useful representations from the bulk of natural data. It's suggested to be responsible for building the visual representation in humans, but the specific objective and algorithm are unknown. Currently, most self-supervised methods encourage the system to learn an invariant representation of different transformations of the same image in contrast to those of other images. However, such transformations are generally non-biologically plausible, and often consist of contrived perceptual schemes such as random cropping and color jittering. In this paper, we attempt to reconfigure these augmentations to be more biologically or perceptually plausible while still conferring the same benefits for encouraging a good representation. Critically, we find that random cropping can be substituted by cortical magnification, and saccade-like sampling of the image could also assist the representation learning. The feasibility of these transformations suggests a potential way that biological visual systems could implement self-supervision. Further, they break the widely accepted spatially-uniform processing assumption used in many computer vision algorithms, suggesting a role for spatially-adaptive computation in humans and machines alike. Link » Binxu Wang · David Mayo · Arturo Deza · Andrei Barbu · Colin Conwell 🔗 Mon 9:00 a.m. - 10:00 a.m. Exploring the Structure of Human Adjective Representations (Poster)  link » Human semantic representations are both difficult to capture and hard to fully interpret. Similarity judgments of words are highly sensitive to context, and association norms may only capture coarse similarity. By contrast, feature norms are more interpretable, and the number of norms can be scaled without limit, but they often only exist for sets nouns described with concrete norms. In this paper, we introduce a new large dataset of nouns normed by a set of continuous adjective ratings both concrete and abstract. We compare our dataset to other forms of representation and find that they capture rich, unique structure, which can be represented by a low-dimensional latent semantic space. We further make relationships between our data and neural network representations from different modalities. Our dataset contributes to an increasingly detailed picture of one relatively sizable swath of human semantic representations, and can be used in a variety of modeling paradigms. Link » Karan Grewal · Joshua Peterson · Bill Thompson · Tom Griffiths 🔗 Mon 9:00 a.m. - 10:00 a.m. Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs (Poster)  link » While some convolutional neural networks (CNNs) have surpassed human visual abilities in object classification, they often struggle to recognize objects in images corrupted with different types of common noise patterns, highlighting a major limitation of this family of models. Recently, it has been shown that simulating a primary visual cortex (V1) at the front of CNNs leads to small improvements in robustness to these image perturbations. In this study, we start with the observation that different variants of the V1 model show gains for specific corruption types. We then build a new model using an ensembling technique, which combines multiple individual models with different V1 front-end variants. The model ensemble leverages the strengths of each individual model, leading to significant improvements in robustness across all corruption categories and outperforming the base model by 38\% on average. Finally, we show that using distillation, it is possible to partially compress the knowledge in the ensemble model into a single model with a V1 front-end. While the ensembling and distillation techniques used here are hardly biologically-plausible, the results presented here demonstrate that by combining the specific strengths of different neuronal circuits in V1 it is possible to improve the robustness of CNNs for a wide range of perturbations. Link » Avinash Baidya · Joel Dapello · James J DiCarlo · Tiago Marques 🔗 Mon 9:00 a.m. - 10:00 a.m. Benchmarking human visual search computational models in natural scenes: models comparison and reference datasets (Poster)  link » Visual search is an essential part of almost any everyday human goal-directed interaction with the environment. Nowadays, several algorithms are able to predict gaze positions during simple observation, but few models attempt to simulate human behavior during visual search in natural scenes. Furthermore, these models vary widely in their design and exhibit differences in the datasets and metrics with which they were evaluated. Thus, there is a need for a reference point, on which each model can be tested and from where potential improvements can be derived. In the present work, we select publicly available state-of-the-art visual search models in natural scenes and evaluate them on different datasets, employing the same metrics to estimate their efficiency and similarity with human subjects. In particular, we propose an improvement to the Ideal Bayesian Searcher through a combination with a neural network-based visual search model, enabling it to generalize to other datasets.The present work sheds light on the limitations of current models and how potential improvements can be accomplished by combining approaches. Moreover, it moves forward on providing a solution for the urgent need for benchmarking data and metrics to support the development of more general human visual search computational models. Link » Fermín Travi · Gonzalo Ruarte · Gaston Bujia · Juan Esteban Kamienkowski 🔗 Mon 9:00 a.m. - 10:00 a.m. Shared Visual Representations of Drawing for Communication: How do different biases affect human interpretability and intent? (Poster)  link » We present an investigation into how representational losses can affect the drawings produced by artificial agents playing a communication game. Building upon recent advances, we show that a combination of powerful pretrained encoder networks, with appropriate inductive biases, can lead to agents that draw recognisable sketches, whilst still communicating well. Further, we start to develop an approach to help automatically analyse the semantic content being conveyed by a sketch and demonstrate that current approaches to inducing perceptual biases lead to a notion of objectness being a key feature despite the agent training being self-supervised. Link » Daniela Mihai · Jonathon Hare 🔗 Mon 9:00 a.m. - 10:00 a.m. Out-of-distribution robustness: Limited image exposure of a four-year-old is enough to outperform ResNet-50 (Poster)  link » Recent gains in model robustness towards out-of-distribution images are predominantly achieved through ever-increasing large-scale datasets. While this approach is very effective in achieving human-level distortion robustness, it raises the question of whether human robustness, too, requires massive amounts of experience. We therefore investigated the developmental trajectory of human object recognition robustness by comparing children aged 4–6, 7–9, 10–12, 13–15 against adults and against different neural networks. Assessing how recognition accuracy degrades when images are distorted by salt-and-pepper noise, we find that while overall performance improves with age, even the youngest children in the study showed remarkable robustness and outperformed standard CNNs on moderately distorted images. Using a back-of-the-envelope calculation, we then estimated the number of images' that those young children had been exposed to during their lifetime. Even if we assume that a new image is seen every 2 seconds of wake time, children aged 4--6 only saw approximately 50M images, which is already lower than 90 epochs $\times$ 1.3M images during standard ImageNet training. This indicates that human out-of-distribution robustness develops very early and may not require seeing billions of different images during lifetime given the right choice of representation and information processing optimised during evolution. Link » Lukas Huber · Robert Geirhos · Felix A. Wichmann 🔗 Mon 9:00 a.m. - 10:00 a.m. Contrastive Learning Through Time (Poster)  link » Contrastive learning has emerged as a powerful form of unsupervised representation learning for images. The utility of learned representations for downstream tasks depends strongly on the chosen augmentation operations. Taking inspiration from biology, we here study contrastive learning through time (CLTT), that works completely without any augmentation operations. Instead, positive pairs of images are generated from temporally close video frames during extended naturalistic interaction with objects. We propose a family of CLTT algorithms based on state-of-the-art contrastive learning methods and test them on three data sets. We demonstrate that CLTT allows linear classification performance that approaches that of the fully supervised setting. We also consider temporal structure resulting from one object being seen systematically before or after another object. We show that this leads to increased representational similarity between these objects ("close in time, will align"), matching classic biological findings. The data sets and code for this paper can be downloaded at: (to be added for final version). Link » Felix Maximilian Schneider · Xia Xu · Markus Ernst · Zhengyang Yu · Jochen Triesch 🔗 Mon 9:00 a.m. - 10:01 a.m. Convolutional Networks are Inherently Foveated (Poster)  link » When convolutional layers apply no padding, central pixels have more ways to contribute to the convolution than peripheral pixels. Such discrepancy grows exponentially with the number of layers, leading to implicit foveation of the input pixels. We show that this discrepancy can persist even when padding is applied. In particular, with the commonly-used zero-padding, foveation effects are significantly reduced but not eliminated. We explore how different aspects of convolution arithmetic impact the spread and magnitude of foveation, and elaborate on which alternative padding techniques can mitigate it. Finally, we compare our findings with foveation in human vision, concluding that both effects are likely of the same nature and have similar implications. Link » Bilal Alsallakh · Vivek Miglani · Narine Kokhlikyan · David Adkins · Orion Reblitz-Richardson 🔗 Mon 9:00 a.m. - 10:00 a.m. Neural Structure Mapping For Learning Abstract Visual Analogies (Poster)  link » Building conceptual abstractions from sensory information and then reasoning about them is central to human intelligence. Abstract reasoning both relies on, and is facilitated by, our ability to make analogies about concepts from known domains to novel domains. Structure Mapping Theory of human analogical reasoning posits that analogical mappings rely on (higher-order) relations and not on the sensory content of the domain. This enables humans to reason systematically about novel domains, a problem with which machine learning (ML) models tend to struggle. We introduce a two-stage neural framework, which we label Neural Structure Mapping (NSM), to learn visual analogies from Raven's Progressive Matrices, an abstract visual reasoning test of fluid intelligence. Our framework uses (1) a multi-task visual relationship encoder to extract constituent concepts from raw visual input in the source domain, and (2) a neural module net analogy inference engine to reason compositionally about the inferred relation in the target domain. Our NSM approach (a) isolates the relational structure from the source domain with high accuracy, and (b) successfully utilizes this structure for analogical reasoning in the target domain. Link » Shashank Shekhar · Graham Taylor 🔗 Mon 9:00 a.m. - 10:00 a.m. Bio-inspired Min-Nets Improve the Performance and Robustness of Deep Networks (Poster)  link » Min-Nets are inspired by end-stopped cortical cells with units that output the minimum of two learned filters. We insert such min-units into state-of-the art deep networks, such as the popular ResNet and DenseNet, and show that the resulting Min-Nets perform better on the Cifar-10 benchmark. Moreover, we show that Min-Nets are more robust against JPEG compression artifacts. We argue that the minimum operation is the simplest way of implementing an AND operation on pairs of filters, and that such AND operations introduce a bias that is appropriate given the statistics of natural images. Link » Philipp Gruening · Erhardt Barth 🔗 Mon 9:00 a.m. - 10:00 a.m. Boxhead: A Dataset for Learning Hierarchical Representations (Poster)  link » Disentanglement is hypothesized to be beneficial towards a number of downstream tasks. However, a common assumption in learning disentangled representations is that the data generative factors are statistically independent. As current methods are almost solely evaluated on toy datasets where this ideal assumption holds, we investigate their performance in hierarchical settings, a relevant feature of real-world data. In this work, we introduce \emph{Boxhead}, a dataset with hierarchically structured ground-truth generative factors. We use this novel dataset to evaluate the performance of state-of-the-art autoencoder-based disentanglement models and observe that hierarchical models generally outperform single-layer VAEs in terms of disentanglement of hierarchically arranged factors. Link » Yukun Chen · Andrea Dittadi · Frederik Träuble · Stefan Bauer · Bernhard Schölkopf 🔗 Mon 9:00 a.m. - 10:00 a.m. What Matters In Branch Specialization? Using a Toy Task to Make Predictions (Poster)  link » What motivates the brain to allocate tasks to different regions and what distinguishes multiple-demand brain regions and the tasks they perform from ones in highly specialized areas? Here we explore these neuroscientific questions using a purely computational framework and theoretical insights. In particular, we focus on how branches of a neural network learn representations contingent on their architecture and optimization task. We train branched neural networks on families of Gabor filters as the input training distribution and optimize them to perform combinations of angle, average color, and size approximation tasks. We find that networks predictably allocate tasks to the branches with appropriate inductive biases. However, this task-to-branch matching is not required for branch specialization, as even identical branches in a network tend to specialize. Finally, we show that branch specialization can be controlled by a curriculum in which tasks are alternated instead of jointly trained. Longer training between alternation corresponds to more even task distribution among branches, providing a possible model for multiple-demand regions in the brain. Link » Chenguang Li · Arturo Deza 🔗 Mon 9:00 a.m. - 10:00 a.m. Cyclic orthogonal convolutions for long-range integration of features (Poster)  link » In Convolutional Neural Networks (CNNs) information flows across a small neighbourhood of each pixel of an image, preventing long-range integration of features before reaching deep layers in the network. Inspired by the neurons of the human visual cortex responding to similar but distant visual features, we propose a novel architecture that allows efficient information flow between features $z$ and locations $(x,y)$ across the entire image with a small number of layers.This architecture uses a cycle of three orthogonal convolutions, not only in $(x,y)$ coordinates, but also in $(x,z)$ and $(y,z)$ coordinates. We stack a sequence of such cycles to obtain our deep network, named CycleNet. When compared to CNNs of similar size, our model obtains competitive results at image classification on CIFAR-10 and ImageNet datasets.We hypothesise that long-range integration favours recognition of objects by shape rather than texture, and we show that CycleNet transfers better than CNNs to stylised images. On the Pathfinder challenge, where integration of distant features is crucial, CycleNet outperforms CNNs by a large margin. Code has been made available at: https://github.com/netX21/Submission Link » Federica Freddi · Jezabel Garcia · Michael Bromberg · Sepehr Jalali · Da-shan Shiu · Alvin Chua · Alberto Bernacchia 🔗 Mon 9:00 a.m. - 10:00 a.m. Category-orthogonal object features guide information processing in recurrent neural networks trained for object categorization (Poster)  link » Recurrent neural networks (RNNs) have been shown to perform better than feedforward architectures in visual object categorization, especially in challenging conditions such as cluttered images. However, little is known about the role recurrent information flow plays in these computations. Here we test an RNN trained for object categorization on the hypothesis that recurrence iteratively aids object categorization via the communication of category-orthogonal auxiliary variables. Using diagnostic linear readouts, we find that: (a) information about auxiliary variables increases across time in all network layers, (b) this information is indeed present in the recurrent information flow, and (c) its manipulation affects task performance. These observations confirm the hypothesis that category-orthogonal auxiliary variable information is conveyed through recurrent connectivity and is used to optimize category judgements in cluttered environments. Link » Sushrut Thorat · Giacomo Aldegheri · Tim Kietzmann 🔗 Mon 9:00 a.m. - 10:00 a.m. Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders (Poster)  link » Category-selectivity in the brain describes the observation that certain spatially localized areas of the cerebral cortex tend to respond robustly and selectively to stimuli from specific limited categories. One of the most well known examples of category-selectivity is the Fusiform Face Area (FFA), an area of the inferior temporal cortex in primates which responds preferentially to images of faces when compared with objects or other generic stimuli. In this work, we leverage the newly introduced Topographic Variational Autoencoder to model of the emergence of such localized category-selectivity in an unsupervised manner. Experimentally, we demonstrate our model yields spatially dense neural clusters selective to faces, bodies, and places through visualized maps of Cohen's d metric. We compare our model with related supervised approaches, namely the TDANN, and discuss both theoretical and empirical similarities. Finally, we show preliminary results suggesting that our model yields a nested spatial hierarchy of increasingly abstract categories, analogous to observations from the human ventral temporal cortex. Link » T. Anderson Keller · Qinghe Gao · Max Welling 🔗 Mon 9:00 a.m. - 10:00 a.m. Learning to perceive objects by prediction (Poster)  link » The representation of objects is the building block of higher-level concepts. Infants develop the notion of objects without supervision. The prediction error of future sensory input is likely the major teaching signal for infants. Inspired by this, we propose a new framework to extract object-centric representation from single 2D images by learning to predict future scenes in the presence of moving objects. We treat objects as latent causes of which the function for an agent is to facilitate efficient prediction of the coherent motion of their parts in visual input. Distinct from previous object-centric models, our model learns to explicitly infer objects' locations in a 3D environment in addition to segmenting objects. Further, the network learns a latent code space where objects with the same geometric shape and texture/color frequently group together. The model requires no supervision or pre-training of any part of the network. We created a new synthetic dataset with more complex textures on objects and background and found several previous models not based on predictive learning overly rely on clustering colors and lose specificity in object segmentation. Our work demonstrates a new approach for learning symbolic representation grounded in sensation and action. Link » Tushar Arora · Li Erran Li · Mingbo Cai 🔗 Mon 10:00 a.m. - 10:10 a.m. Controlled-rearing studies of newborn chicks and deep neural networks (Oral)  link »    Convolutional neural networks (CNNs) can now achieve human-level performance on challenging object recognition tasks. CNNs are also the leading quantitative models in terms of predicting neural and behavioral responses in visual recognition tasks. However, there is a widely accepted critique of CNN models: unlike newborn animals, which learn rapidly and efficiently, CNNs are thought to be “data hungry,” requiring massive amounts of training data to develop accurate models for object recognition. This critique challenges the promise of using CNNs as models of visual development. Here, we directly examined whether CNNs are more data hungry than newborn animals by performing parallel controlled-rearing experiments on newborn chicks and CNNs. We raised newborn chicks in strictly controlled visual environments, then simulated the training data available in that environment by constructing a virtual animal chamber in a video game engine. We recorded the visual images acquired by an agent moving through the virtual chamber and used those images to train CNNs. When CNNs were provided with similar visual training data as chicks, the CNNs successfully solved the same challenging view-invariant object recognition tasks as the chicks. Thus, the CNNs were not more data hungry than animals: both CNNs and chicks successfully developed robust object models from training data of a single object. Link » Donsuk Lee · Pranav Gujarathi · Justin Wood 🔗 Mon 10:10 a.m. - 10:20 a.m. Multimodal neural networks better explain multivoxel patterns in the hippocampus (Oral)  link »    The human hippocampus possesses concept cells'', neurons that fire when presented with stimuli belonging to a specific concept, regardless of the modality. Recently, similar concept cells were discovered in a multimodal network called CLIP [1].Here, we ask whether CLIP can explain the fMRI activity of the human hippocampus better than a purely visual (or linguistic) model. We extend our analysis to a range of publicly available uni- and multi-modal models. We demonstrate thatmultimodality'' stands out as a key component when assessing the ability of a network to explain the multivoxel activity in the hippocampus. Link » Bhavin Choksi · Milad Mozafari · Rufin VanRullen · Leila Reddy 🔗 Mon 10:20 a.m. - 10:40 a.m. Michelle Greene: "What we don't see can hurt us: dataset bias and its implications" (Invited Talk) 🔗 Mon 10:40 a.m. - 11:00 a.m. Zoya Bylinskii: "Why does where people look matter? New trends & applications of visual attention modeling" (Invited Talk) 🔗 Mon 11:00 a.m. - 11:20 a.m. Maryam Vaziri-Pashkam: "Beyond labeling THINGS-In-3D: is one visual pathway enough?" (Invited Talk) 🔗 Mon 11:20 a.m. - 11:40 a.m. Xavier Boix: "Robustness to Transformations Across Categories: Is Robustness Driven by Invariant Neural Representations?" (Invited Talk) 🔗 Mon 12:00 p.m. - 1:00 p.m. Bio-inspired learnable divisive normalization for ANNs (Poster)  link » In this work we introduce DivNormEI, a novel bio-inspired convolutional network that performs divisive normalization, a canonical cortical computation, along with lateral inhibition and excitation that is tailored for integration into modern Artificial Neural Networks (ANNs). DivNormEI, an extension of prior computational models of divisive normalization in the primate primary visual cortex, is implemented as a modular layer that can be integrated in a straightforward manner into most commonly used modern ANNs. DivNormEI normalizes incoming activations via learned non-linear within-feature shunting inhibition along with across-feature linear lateral inhibition and excitation. In this work, we show how the integration of DivNormEI within a task-driven self-supervised encoder-decoder architecture encourages the emergence of the well-known contrast-invariant tuning property found to be exhibited by simple cells in the primate primary visual cortex. Additionally, the integration of DivNormEI into an ANN (VGG-9 network) trained to perform image classification on ImageNet-100 improves both sample efficiency and top-1 accuracy on a held-out validation set. We believe our findings from the bio-inspired DivNormEI model that simultaneously explains properties found in primate V1 neurons and outperforms the competing baseline architecture on large-scale object recognition will promote further investigation of this crucial cortical computation in the context of modern machine learning tasks and ANNs. Link » Vijay Veerabadran · Ritik Raina · Virginia de Sa 🔗 Mon 12:00 p.m. - 1:00 p.m. What can 5.17 billion regression fits tell us about artificial models of the human visual system? (Poster)  link » Rapid simultaneous advances in machine vision and cognitive neuroimaging present an unparalleled opportunity to assess the current state of artificial models of the human visual system. Here, we perform a large-scale benchmarking analysis of 72 modern deep neural network models to characterize with robust statistical power how differences in architecture and training task contribute to the prediction of human fMRI activity across 16 distinct regions of the human visual system. We find: one, that even stark architectural differences (e.g. the absence of convolution in transformers and MLP-mixers) have very little consequence in emergent fits to brain data; two, that differences in task have clear effects--with categorization and self-supervised models showing relatively stronger brain predictivity across the board; three, that feature reweighting leads to substantial improvements in brain predictivity, without overfitting -- yielding model-to-brain regression weights that generalize at the same level of predictivity to brain responses over 1000s of new images. Broadly, this work presents a lay-of-the-land for the emergent correspondences between the feature spaces of modern deep neural network models and the representational structure inherent to the human visual system. Link » Colin Conwell · Jacob Prince · George Alvarez · Talia Konkle 🔗 Mon 12:00 p.m. - 1:00 p.m. Unsupervised Representation Learning Facilitates Human-like Spatial Reasoning (Poster)  link » When judging the sameness of three-dimensional (3D) objects that differ by a rotation, response time typically increases with the angle of rotation. This increase is usually taken as evidence for mental rotation, but the extent to which low-level perceptual mechanisms contribute to this phenomenon is unclear. To investigate this, we built a neural model that breaks down this computation into two stages: a fast feedforward stage that extracts low-dimensional latent representations of the objects being compared, and a slow recurrent processing stage that compares those representations to arrive at a decision by accumulating evidence at a rate that is proportional to the proximity of the representations. We found that representation of 3D objects learned by a generic autoencoder was sufficient to emulate human response times using this model. We conclude that perceptual representations may play a key role in limiting the speed of spatial reasoning. We discuss our findings in the context of the mental rotation hypothesis and identify additional, as yet unverified representational constraints that must be satisfied by neural systems that perform mental rotation. Link » Kaushik Lakshminarasimhan · Colin Conwell 🔗 Mon 12:00 p.m. - 1:00 p.m. Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks (Poster)  link » Recent work suggests that feature constraints in the training datasets of deep neural networks (DNNs) drive robustness to adversarial noise (Ilyas et al., 2019). The representations learned by such adversarially robust networks have also been shown to be more human perceptually-aligned than non-robust networks via image manipulations (Santurkar et al., 2019, Engstrom et al., 2019). Despite appearing closer to human visual perception, it is unclear if the constraints in robust DNN representations match biological constraints found in human vision. Human vision seems to rely on texture-based/summary statistic representations in the periphery, which have been shown to explain phenomena such as crowding (Balas et al., 2009) and performance on visual search tasks (Rosenholtz et al., 2012). To understand how adversarially robust optimizations/representations compare to human vision, we performed a psychophysics experiment using a metamer task similar to Freeman \& Simoncelli, 2011, Wallis et al., 2016 and Deza et al., 2019 where we evaluated how well human observers could distinguish between images synthesized to match adversarially robust representations compared to non-robust representations and a texture synthesis model of peripheral vision (Texforms a la Long et al., 2018). We found that the discriminability of robust representation and texture model images decreased to near chance performance as stimuli were presented farther in the periphery. Moreover, performance on robust and texture-model images showed similar trends within participants, while performance on non-robust representations changed minimally across the visual field. These results together suggest that (1) adversarially robust representations capture peripheral computation better than non-robust representations and (2) robust representations capture peripheral computation similar to current state-of-the-art texture peripheral vision models. More broadly, our findings support the idea that localized texture summary statistic representations may drive human invariance to adversarial perturbations and that the incorporation of such representations in DNNs could give rise to useful properties like adversarial robustness. Link » Anne Harrington · Arturo Deza 🔗 Mon 12:00 p.m. - 1:00 p.m. Exploiting 3D Shape Bias towards Robust Vision (Poster)  link » Robustness research in machine vision faces a challenge. Many variants of ImageNet-scale robustness benchmarks have been proposed, only to reveal that current vision systems fail under distributional shifts. Although aiming for higher robustness accuracy on these benchmarks is important, we also observe that simply using larger models and larger training datasets may not lead to true robustness, demanding further innovation. To tackle the problem from a new perspective, we encourage closer collaboration between the robustness and 3D vision communities. This proposal is inspired by human vision, which is surprisingly robust to environmental variation, including both naturally occurring disturbances and artificial corruptions. We hypothesize that such robustness, at least in part, arises from our ability to infer 3D geometry from 2D retinal projections. In this work, we take a first step toward testing this hypothesis by viewing 3D reconstruction as a pretraining method for building more robust vision systems. We introduce a novel dataset called Geon3D, which is derived from objects that emphasize variation across shape features that the human visual system is thought to be particularly sensitive. This dataset enables, for the first time, a controlled setting where we can isolate the effect of 3D shape bias'' in robustifying neural networks, and informs new approaches for increasing robustness by exploiting 3D vision tasks. Using Geon3D, we find that CNNs pretrained on 3D reconstruction are more resilient to viewpoint change, rotation, and shift than regular CNNs. Further, when combined with adversarial training, 3D reconstruction pretrained models improve adversarial and common corruption robustness over vanilla adversarially-trained models. We hope that our findings and dataset will encourage exploitation of synergies between the robustness researchers, 3D computer vision community, and computational perception researchers in cognitive science, paving a way for achieving human-like robustness under complex, real-world stimuli conditions. Link » Yutaro Yamada · Yuval Kluger · Sahand Negahban · Ilker Yildirim 🔗 Mon 12:00 p.m. - 1:00 p.m. Evaluating the Adversarial Robustness of a Foveated Texture Transform Module in a CNN (Poster)  link » Biologically inspired mechanisms such as foveation and multiple fixation points have previously been shown to help alleviate adversarial examples (Reddy et al., 2020). By mimicking the effects of visual crowding present in human vision, foveated, texture-based computations may provide another route for increasing adversarial robustness. Previous statistical models of texture rendering (Portilla & Simoncelli, 2000; Gatys et al., 2015) paved the way for the development of a Foveated Texture Transform (FTT) module which utilizes localized texture synthesis in foveated receptive fields (Deza et al., 2017). The FTT module was added to a VGG-11 CNN architecture and ten random initializations were trained on 20-class subsets of the Places and EcoSet datasets for scene and object classification respectively. The trained networks were attacked using Projected Gradient Descent (PGD) and the adversarial accuracies were calculated at multiple epochs to evaluate changes in robustness as the networks trained. The results indicate that the FTT module significantly improved adversarial robustness for scene classification, especially when the validation loss was at a minimum. However, the FTT module did not provide a statistically significant increase in adversarial robustness for object classification. Furthermore, we do not find a trade-off between accuracy and robustness (Tsipras et al., 2018) for the FTT module suggesting a benefit of using foveated, texture-based distortions in the latent space during learning compared to non-perturbed latent space representations. Finally, we investigate the nature of latent space distortions with additional controls that probe other directions in the latent space that are not texture-based. Link » Jonathan Gant · Andrzej Banburski · Arturo Deza 🔗 Mon 12:00 p.m. - 1:00 p.m. Are models trained on temporally-continuous data streams more adversarially robust? (Poster)  link » Task-optimized convolutional neural networks are the most quantitatively accurate models of the primate visual system. Unlike humans, however, these models can easily be fooled by modifying their inputs with human-imperceptible image perturbations, resulting in poor adversarial robustness. Prior work showed that modifying a model's training objective or its architecture can improve its adversarial robustness. Another ingredient in building computational models of sensory cortex is the training dataset and, to our knowledge, its effect on a model's adversarial robustness has not be investigated. Motivated by observations that chicks develop more invariant visual representations with more temporally-continuous visual experience, we here evaluate a model's adversarial robustness when it is trained on a more naturalistic dataset---a longitudinal video dataset collected from the perspective of infants (SAYCam; Sullivan et al., 2020). By evaluating the adversarial robustness of models on $26$-way classification of a set of annotated video frames from this dataset, we find that models that have been pre-trained on SAYCam video frames are more robust than those that have been pre-trained on ImageNet. Our results suggest that to build models that are adversarially robust, additional efforts should be made in curating datasets that are more similar to the natural image sequences and the visual experience infants receive. Link » Nathan Kong · Anthony Norcia 🔗 Mon 12:00 p.m. - 1:00 p.m. In Silico Modelling of Neurodegeneration Using Deep Convolutional Neural Networks (Poster)  link » Although current research aims to use and improve deep learning networks by applying knowledge about the structure and function of the healthy human brain and vice versa, the potential of using such networks to model neurodegenerative diseases remains largely understudied. In this work, we present a novel feasibility study modeling dementia in silico with deep convolutional neural networks. Therefore, deep convolutional neural networks were fully trained to perform visual object recognition, and then progressively injured in two distinct ways. More precisely, damage was progressively inflicted mimicking neuronal as well as synaptic injury. Synaptic injury was applied by randomly deleting weights in the network, while neuronal injury was simulated by removing full nodes or filters in the network. After each iteration of injury, network object recognition accuracy was evaluated. Saliency maps were generated using the uninjured and injured networks and quantitatively compared using the structural similarity index measure for test set images to further investigate the loss of visual cognition. The quantitative evaluation revealed cognitive function of the network progressively decreased with increasing injury load. This effect was more pronounced for synaptic damage. As damage increased, the model focus shifted away from the main objects in the images and became more dispersed. This shift in attention was quantitatively evidenced by a decrease in the structural similarity index measure comparing the saliency maps of corresponding uninjured and injured models, as a function of injury. The results of this study provide a promising foundation to develop in silico models of neurodegenerative diseases using deep learning networks. The effects of neurodegeneration found for the in silico model are especially similar to the loss of visual cognition seen in patients with posterior cortical atrophy. Link » Jasmine Moore · Anup Tuladhar · Zahinoor Ismail · Nils Daniel Forkert 🔗 Mon 12:00 p.m. - 1:00 p.m. Signal Strength and Noise Drive Feature Preference in CNN Image Classifiers (Poster)  link » Feature preference in Convolutional Neural Network (CNN) image classifiers is integral to their decision making process, and while the topic has been well studied, it is still not understood at a fundamental level. We test a range of task relevant feature attributes (including shape, texture, and color) with varying degrees of signal and noise in highly controlled CNN image classification experiments using synthetic datasets to determine feature preferences. We find that CNNs will prefer features with stronger signal strength and lower noise irrespective of whether the feature is texture, shape, or color. This provides guidance for a predictive model for task relevant feature preferences, demonstrates pathways for bias in machine models that can be avoided with careful controls on experimental setup, and suggests that comparisons between how humans and machines prefer task relevant features in vision classification tasks should be revisited. Link » Max Wolff · Stuart Wolff 🔗 Mon 12:00 p.m. - 1:00 p.m. A finer mapping of convolutional neural network layers to the visual cortex (Poster)  link » There is increasing interest in understanding similarities and differences between convolutional neural networks (CNNs) and the visual cortex. A common approach is to use features extracted from intermediate CNN layers to fit brain encoding models. Each brain region is then typically associated with the best predicting layer. However, this winner-take-all mapping is non-robust, because consecutive CNN layers are strongly correlated and have similar prediction accuracies. Moreover, the winner-take-all approach ignores potential complementarities between layers to predict brain activity. To address this issue, we propose to fit a joint model on all layers simultaneously. The model is fit with banded ridge regression, grouping features by layer, and learning a separate regularization hyperparameter per feature space. By performing a selection over layers, this model effectively removes non-predictive or redundant layers and disentangles the contributions of each layer on each voxel. This model leads to increased prediction accuracy and to finer mappings of layer selectivity. Link » Tom Dupre la Tour · Michael Lu · Michael Eickenberg · Jack Gallant 🔗 Mon 12:00 p.m. - 1:00 p.m. Controlled-rearing studies of newborn chicks and deep neural networks (Poster)  link » Convolutional neural networks (CNNs) can now achieve human-level performance on challenging object recognition tasks. CNNs are also the leading quantitative models in terms of predicting neural and behavioral responses in visual recognition tasks. However, there is a widely accepted critique of CNN models: unlike newborn animals, which learn rapidly and efficiently, CNNs are thought to be “data hungry,” requiring massive amounts of training data to develop accurate models for object recognition. This critique challenges the promise of using CNNs as models of visual development. Here, we directly examined whether CNNs are more data hungry than newborn animals by performing parallel controlled-rearing experiments on newborn chicks and CNNs. We raised newborn chicks in strictly controlled visual environments, then simulated the training data available in that environment by constructing a virtual animal chamber in a video game engine. We recorded the visual images acquired by an agent moving through the virtual chamber and used those images to train CNNs. When CNNs were provided with similar visual training data as chicks, the CNNs successfully solved the same challenging view-invariant object recognition tasks as the chicks. Thus, the CNNs were not more data hungry than animals: both CNNs and chicks successfully developed robust object models from training data of a single object. Link » Donsuk Lee · Pranav Gujarathi · Justin Wood 🔗 Mon 12:00 p.m. - 1:00 p.m. Multimodal neural networks better explain multivoxel patterns in the hippocampus (Poster)  link » The human hippocampus possesses concept cells'', neurons that fire when presented with stimuli belonging to a specific concept, regardless of the modality. Recently, similar concept cells were discovered in a multimodal network called CLIP [1].Here, we ask whether CLIP can explain the fMRI activity of the human hippocampus better than a purely visual (or linguistic) model. We extend our analysis to a range of publicly available uni- and multi-modal models. We demonstrate that`multimodality'' stands out as a key component when assessing the ability of a network to explain the multivoxel activity in the hippocampus. Link » Bhavin Choksi · Milad Mozafari · Rufin VanRullen · Leila Reddy 🔗 Mon 1:00 p.m. - 1:20 p.m. Tiago Marques: "From primary visual cortex to object recognition | The 2022 Brain-Score competition" (Invited Talk) 🔗 Mon 1:20 p.m. - 1:40 p.m. Kohitij Kar: "Role of recurrent computations in primate visual object recognition" (Invited Talk) 🔗 Mon 1:40 p.m. - 2:00 p.m. Yalda Mohsenzadeh: "Understanding, Predicting, and Manipulating Image Memorability with Representation Learning" (Invited Talk) 🔗 Mon 2:00 p.m. - 2:20 p.m. Ruben Coen-Cagli: "Measuring and modeling perceptual segmentation in natural vision" (Invited Talk) 🔗 Mon 2:20 p.m. - 2:40 p.m. Ruth Rosenholtz: "Understanding Peripheral Vision: Lessons Learned About Vision in General" (Invited Talk) 🔗 Mon 2:50 p.m. - 3:00 p.m. Closing Remarks + Award Presentation (Remarks)    SVRHM's 2021 Closing Remarks + Award Presentation 🔗