Timezone: »

Workshop
Shared Visual Representations in Human and Machine Intelligence (SVRHM)
Arturo Deza · Joshua Peterson · N Apurva Ratan Murty · Tom Griffiths

Fri Dec 02 06:45 AM -- 04:00 PM (PST) @ Room 394-395

 Fri 6:45 a.m. - 7:00 a.m. Welcome + Opening Remarks (Opening Remarks and Stats)    Welcome N Apurva Ratan Murty 🔗 Fri 7:00 a.m. - 7:10 a.m. The bandwidth of perceptual awareness is constrained by specific high-level visual features (Invited Oral)    Invited Oral Michael Cohen 🔗 Fri 7:10 a.m. - 7:20 a.m. Learning to Look by Self-Prediction (Invited Oral)    Invited Oral Matthew Grimes 🔗 Fri 7:20 a.m. - 7:40 a.m. Do deep neural networks explain how humans represent scenes? (Talk)    Talk Iris Groen 🔗 Fri 7:40 a.m. - 8:00 a.m. Gestalt grouping cues for understanding complex scenes: evidence from psychophysics, neuroscience, and computer vision (Talk)    Talk Dirk Bernhardt-Walther 🔗 Fri 8:00 a.m. - 8:20 a.m. Local Convolutions Cause an Implicit Bias towards High Frequency Adversarial Examples (Talk)    Talk Josue Ortega Caro 🔗 Fri 8:20 a.m. - 8:40 a.m. Using machines to test ‘why’ human face perception works the way it does (Talk)    Talk Katharina Dobs 🔗 Fri 8:40 a.m. - 9:00 a.m. Kate Storrs (Talk) 🔗 Fri 9:00 a.m. - 10:00 a.m. Lunch Break + Poster Session 1 🔗 Fri 10:00 a.m. - 10:10 a.m. Top-down effects in an early visual cortex inspired hierarchical Variational Autoencoder (Talk)    Invited Oral Balázs Meszéna 🔗 Fri 10:10 a.m. - 10:20 a.m. Predictive Dynamics Improve Noise Robustness in a Deep Network Model of the Human Auditory System (Talk)    Invited Oral Ching Fang 🔗 Fri 10:20 a.m. - 10:40 a.m. The Perceptual Primacy of Feeling: Affectless machine vision models explain the majority of variance in visually evoked affect and aesthetics (Talk)    Talk Colin Conwell 🔗 Fri 10:40 a.m. - 11:00 a.m. Generative Collage and its Sticky Questions on Human-AI Co-Creativity (Talk)    Talk Piotr Mirowski 🔗 Fri 11:00 a.m. - 11:20 a.m. Line Drawings as Communication (Talk)    Talk Phillip Isola 🔗 Fri 11:20 a.m. - 11:40 a.m. Tangible Abstractions (Talk)    Talk Tom White 🔗 Fri 11:40 a.m. - 12:00 p.m. A computational model predicts humans' aesthetic judgments based on deep neural network feature values (Talk)    Talk Aenne Brielmann 🔗 Fri 12:00 p.m. - 1:00 p.m. Margaret Livingstone (Keynote) 🔗 Fri 1:00 p.m. - 2:00 p.m. Break + Poster Session 2 🔗 Fri 2:00 p.m. - 2:30 p.m. A report on recent experimental tests of two predictions of contemporary computable models of the biological deep neural network underling primate visual intelligence (Quest Innaugural Talk) James J DiCarlo 🔗 Fri 2:30 p.m. - 2:40 p.m. Topographic DCNNs trained on a single self-supervised task capture the functional organization of cortex into visual processing streams (Talk)    Invited Oral Dawn Finzi 🔗 Fri 2:40 p.m. - 2:50 p.m. Distinguishing representational geometries with controversial stimuli: Bayesian experimental design and its application to face dissimilarity judgments (Talk)    Invited Oral Tal Golan · Wenxuan Guo 🔗 Fri 2:50 p.m. - 3:10 p.m. Evaluating neural network models of vision and audition with model metamers (Talk)    Talk Jenelle Feather 🔗 Fri 3:10 p.m. - 3:30 p.m. Complimenting Learning Systems supporting 3D object perception (Talk) tyler bonnen 🔗 Fri 3:30 p.m. - 3:50 p.m. What's the endgame of neuroAI? (Talk)    Talk Patrick Mineault 🔗 Fri 3:50 p.m. - 4:00 p.m. Closing Remarks, Award Ceremony and Reception Arturo Deza 🔗 - Exploring the perceptual straightness of adversarially robust and biologically-inspired visual representations (Poster)  link » Humans have been shown to use a ''straightened'' encoding to represent the natural visual world as it evolves in time (H\'enaff et al.~2019). In the context of discrete video sequences, ''straightened'' means that changes between frames follow a more linear path in representation space at progressively deeper levels of processing. While deep convolutional networks are often proposed as models of human visual processing, many do not straighten natural videos. In this paper, we explore the relationship between robustness, biologically-inspired filtering mechanisms, and representational straightness in neural networks in response to time-varying input, and identify curvature as a useful way of evaluating neural network representations. We find that $(1)$ adversarial training leads to straighter representations in both CNN and transformer-based architectures and $(2)$ biologically-inspired elements increase straightness in the early stages of a network, but do not guarantee increased straightness in downstream layers of CNNs. Our results suggest that constraints like adversarial robustness bring computer vision models closer to human vision, but when incorporating biological mechanisms such as V1 filtering, additional modifications are needed to more fully align human and machine representations. Link » Anne Harrington · Vasha DuTell · Ayush Tewari · Mark Hamilton · Simon Stent · Ruth Rosenholtz · Bill Freeman 🔗 - Image-computable Bayesian model for 3D motion estimation with natural stimuli explains human biases (Poster)  link » Humans estimate the 3D motion of the self and of objects in the natural environment from the 2D images formed in the two eyes. To understand how this problem should be solved, we trained Bayesian observers on naturalistic binocular videos to solve two tasks: 3D speed estimation and 3D direction estimation. The resulting normative models leverage two cues used by humans--the changing disparity and the interocular velocity difference cues--and show that quadratic combination of linear filter responses is an optimal computation for speed estimation but not for direction estimation. The models reproduce the psychophysical response patterns that characterize human performance in 3D motion estimation tasks, including biases, discrimination thresholds, and counter-intuitive towards-away confusions. These results suggest that, rather than resulting from a haphazardly constructed system, the sometimes surprising performance patterns in human 3D motion perception result from optimal visual information processing. Link » Daniel Herrera-Esposito · Johannes Burge 🔗 - Deep tensor factorization models of first impressions (Poster)  link » Machine-vision representations of faces can be aligned to people’s first impressions of others (e.g., perceived trustworthiness) to create highly predictive models of biases in social perception. Here, we use deep tensor fusion to create a unified model of first impressions that combines information from three channels: (1) visual information from pretrained machine-vision models, (2) linguistic information from pretrained language models, and (3) demographic information from self-reported demographic variables. We test the ability of the model to generalize to held-out faces, traits, and participants and measure its fidelity to a large dataset of people’s first impressions of others. Link » Yangyang Yu · Jordan Suchow 🔗 - Measuring the Alignment of ANNs and Primate V1 on Luminance and Contrast Response Characteristics (Poster)  link » Some artificial neural networks (ANNs) are the current state-of-the-art in modeling the primate ventral stream and object recognition behavior. However, how well they align with luminance and contrast processing in early visual areas is not known. Here, we compared luminance and contrast processing in ANN models of V1 and primate V1 at the level of single-neuron. Model neurons have luminance and contrast response characteristics that differ from those observed in macaque V1 neurons. In particular, model neurons have responses weakly modulated by changes in luminance and show non-saturating responses to high contrast stimuli. While no model perfectly matches macaque V1, there is great variability in their V1-alignment. Variability in luminance and contrast scores is not correlated suggesting that there are trade-offs in the model space of ANN V1 models. Link » Stephanie Olaiya · Tiago Marques · James J DiCarlo 🔗 - The emergence of visual simulation in task-optimized recurrent neural networks (Poster)  link » Primates display remarkable prowess in making rapid visual inferences even when sensory inputs are impoverished. One hypothesis about how they accomplish this is through a process called visual simulation, in which they imagine future states of their environment using a constructed mental model. Though a growing body of behavioral findings, in both humans and non-human primates, provides credence to this hypothesis, the computational mechanisms underlying this ability remain poorly understood. In this study, we probe the capability of feedforward and recurrent neural network models to solve the Planko task, parameterized to systematically control task variability. We demonstrate that visual simulation emerges as the optimal computational strategy in deep neural networks only when task variability is high. Moreover, we provide some of the first evidence that information about imaginary future states can be decoded from the model latent representations, despite no explicit supervision. Taken together, our work suggests that the optimality of visual simulation is task-specific and provides a framework to test its mechanistic basis. Link » Alekh Karkada Ashok · Lakshmi Narasimhan Govindarajan · Drew Linsley · David Sheinberg · Thomas Serre 🔗 - Predictive Dynamics Improve Noise Robustness in a Deep Network Model of the Human Auditory System (Oral)  link » Sensory systems are robust to many types of corrupting noise. However, the neural mechanisms that drive robustness are unclear. Empirical evidence suggests that top-down predictions are important for processing noisy stimuli, and the substantial feedback connections in primate sensory cortices have been proposed to facilitate these predictions. Here, we implement predictive dynamics in a large scale model of the human auditory system. Specifically, we augment a feedforward deep neural network trained on noisy speech classification with a recently introduced predictive feedback scheme. We find that predictive dynamics improve speech identification across several types of corrupting noise. These performance gains were associated with denoising of network representations and alterations in layer dimensionality. Finally, we find that the model captures brain data outside of the speech domain. Overall, this work demonstrates that predictive dynamics are a candidate mechanism for human auditory robustness and provides a testbed for hypotheses regarding the dynamics of auditory representations. Additionally, we discuss the potential for this framework to provide insight into robustness mechanisms across sensory modalities. Link » Ching Fang · Erica Shook · Justin Buck · Guillermo Horga 🔗 - Adapting Brain-Like Neural Networks for Modeling Cortical Visual Prostheses (Poster)  link » Cortical prostheses are devices implanted in the visual cortex that attempt to restore lost vision by electrically stimulating neurons. Currently, the vision provided by these devices is limited, and accurately predicting the visual percepts resulting from stimulation is an open challenge. We propose to address this challenge by utilizing ‘brain-like’ convolutional neural networks (CNNs), which have emerged as promising models of the visual system. To investigate the feasibility of adapting brain-like CNNs for modeling visual prostheses, we developed a proof-of-concept model to predict the perceptions resulting from electrical stimulation. We show that a neurologically-inspired decoding of CNN activations produces qualitatively accurate phosphenes, comparable to phosphenes reported by real patients. Overall, this is an essential first step towards building brain-like models of electrical stimulation, which may not just improve the quality of vision provided by cortical prostheses but could also further our understanding of the neural code of vision. Link » Jacob Granley · Alexander Riedel 🔗 - Implementing Divisive Normalization in CNNs Improves Robustness to Common Image Corruptions (Poster)  link » Some convolutional neural networks (CNNs) have achieved state-of-the-art performance in object classification. However, they often fail to generalize to images perturbed with different types of common corruptions, impairing their deployment in real-world scenarios. Recent studies have shown that more closely mimicking biological vision in early areas such as the primary visual cortex (V1) can lead to some improvements in robustness. Here, we extended this approach and introduced at the V1 stage of a biologically-inspired CNN a layer inspired by the neuroscientific model of divisive normalization, which has been widely used to model activity in early vision. This new model family, the VOneNetDN, when compared to the standard base model maintained clean accuracy (relative accuracy of 99%) while greatly improving its robustness to common image corruptions (relative gain of 18%). The VOneNetDN showed a better alignment to primate V1 for some (contrast and surround modulation) but not all response properties when compared to the model without divisive normalization. These results serve as further evidence that neuroscience can still contribute to progress in computer vision. Link » Andrew Cirincione · Reginald Verrier · Artiom Bic · Stephanie Olaiya · James J DiCarlo · Lawrence Udeigwe · Tiago Marques 🔗 - Towards Disentangling the Roles of Vision & Language in Aesthetic Experience with Multimodal DNNs (Poster)  link » When we experience a visual stimulus as beautiful, how much of that response is the product of ineffable perceptual computations we cannot readily describe versus semantic or conceptual knowledge we can easily translate into natural language? Disentangling perception from language in any experience (especially aesthetics) through behavior or neuroimaging is empirically laborious, and prone to debate over precise definitions of terms. In this work, we attempt to bypass these difficulties by using the learned representations of deep neural network models trained exclusively on vision, exclusively on language, or a hybrid combination of the two, to predict human ratings of beauty for a diverse set of naturalistic images by way of linear decoding. We first show that while the vast majority (~75%) of explainable variance in human beauty ratings can be explained with unimodal vision models (e.g. SEER), multimodal models that learn via language alignment (e.g. CLIP) do show meaningful gains (~10%) over their unimodal counterparts (even when controlling for dataset and architecture). We then show, however, that unimodal language models (e.g. GPT2) whose outputs are conditioned directly on visual representations provide no discernible improvement in prediction, and that machine-generated linguistic descriptions of the stimuli explain a far smaller fraction (~40%) of the explainable variance in ratings compared to vision alone. Taken together, these results showcase a general methodology for disambiguating perceptual and linguistic abstractions in aesthetic judgments using models that computationally separate one from the other. Link » Colin Conwell · Christopher Hamblin 🔗 - Fast temporal decoding from large-scale neural recordings in monkey visual cortex (Poster)  link » With new developments in electrode and nanoscale technology, a large-scale multi-electrode cortical neural prosthesis with thousands of stimulation and recording electrodes is becoming viable. Such a system will be useful as both a neuroscience tool and a neuroprosthesis.In the context of a visual neuroprosthesis, a rudimentary form of vision can be presented to the visually impaired by stimulating the electrodes to induce phosphene patterns. Additional feedback in a closed-loop system can be provided by rapid decoding of recorded responses from relevant brain areas. This work looks at temporal decoding results from a dataset of 1024 electrode recordings collected from the V1 and V4 areas of a primate performing a visual discrimination task. By applying deep learning models, the peak decoding accuracy from the V1 data can obtained by a moving time window of 150 ms across the 800 ms phase of stimulus presentation. The peak accuracy from the V4 data is achieved at a larger latency and by using a larger moving time window of 300ms. Decoding using a running window of 30 ms on the V1 data showed only a $4\% drop$ in peak accuracy. We also determined the robustness of the decoder to electrode failure by choosing a subset of important electrodes using a previously reported algorithm for scaling the importance of inputs to a network. Results show that the accuracy of $89.6\%$ from a network trained on the selected subset of 256 electrodes is close to the accuracy of $91.1\%$ from using all 1024 electrodes. Link » Jerome Hadorn · Zuowen Wang · Bodo Rueckauer · Xing Chen · Pieter Roelfsema · Shih-Chii Liu 🔗 - Local Geometry Constraints in V1 with Deep Recurrent Autoencoders (Poster)  link » The classical sparse coding model represents visual stimuli as a convex combination of a handful of learned basis functions that are Gabor-like when trained on natural image data. However, the Gabor-like filters learned by classical sparse coding far overpredict well-tuned simple cell receptive field (SCRF) profiles. The autoencoder that we use to address this problem, which maintains a natural hierarchical structure when paired with a discriminative loss, is evaluated with a weighted-$\ell_1$ (WL) penalty that encourages self-similarity of basis function usage. The weighted-$\ell_1$ constraint matches the spatial phase symmetry of recent contrastive objectives while maintaining core ideas of the sparse coding framework, yet also offers a promising path to describe the differentiation of receptive fields in terms of this discriminative hierarchy in future work. Link » Jonathan Huml · Demba Ba 🔗 - The bandwidth of perceptual awareness is constrained by specific high-level visual features (Oral)  link » When observers glance at a natural scene, which aspects of that scene ultimately reach perceptual awareness? To answer this question, we showed observers images of scenes that had been altered in numerous ways in the periphery (e.g., scrambling, rotating, filtering, etc.) and measured how often these different alterations were noticed in an inattentional blindness paradigm. Then, we screened a wide range of deep convolutional neural network architectures and asked which layers and features best predict the rates at which observers noticed these alterations. We found that features in the higher (but not earlier) layers predicted how often observers noticed different alterations with extremely high accuracy (at the estimated noise ceiling). Surprisingly, the model prediction accuracy was driven by a very small fraction of features that were both necessary and sufficient to predict the observed behavior, which we could easily visualize. Together these results indicate that human perceptual awareness is limited by high-level visual features that we can estimate using computational methods. Link » Michael Cohen · Kirsten Lydic · N Apurva Ratan Murty 🔗 - How much human-like visual experience do current self-supervised learning algorithms need in order to achieve human-level object recognition? (Poster)  link » This paper addresses a fundamental question: how good are our current self-supervised visual representation learning algorithms relative to humans? More concretely, how much "human-like" natural visual experience would these algorithms need in order to reach human-level performance in a complex, realistic visual object recognition task such as ImageNet? Using a scaling experiment, here we estimate that the answer is several orders of magnitude longer than a human lifetime: typically on the order of a million to a billion years of natural visual experience (depending on the algorithm used). We obtain even larger estimates for achieving human-level performance in ImageNet-derived robustness benchmarks. The exact values of these estimates are sensitive to some underlying assumptions, however even in the most optimistic scenarios they remain orders of magnitude larger than a human lifetime. Link » Emin Orhan 🔗 - How do humans and machine learning models track multiple objects through occlusion? (Poster)  link » Interacting with a complex environment often requires us to track multiple task-relevant objects not all of which are continually visible. The cognitive literature has focused on tracking a subset of visible identical abstract objects (e.g. circles), isolating the tracking component from its context in real-world experience. In the real world, object tracking is harder in that objects may not be continually visible and easier in that objects differ in appearance and so their recognition can rely on both remembered position and current appearance. Here we introduce a generalized task that combines tracking and recognition of valued objects that move in complex trajectories and frequently disappear behind occluders. Humans and models (from the computer-vision literature on object tracking) performed tasks varying widely in terms of the number of objects to be tracked, the number of distractors, the presence of an occluder, and the appearance similarity between targets and distractors. We replicated results from the human literature, including a deterioration of tracking performance with the number and similarity of targets and distractors. In addition, we find that increasing levels of occlusion reduce performance. All models tested here behaved in qualitatively different ways from human observers, showing superhuman performance for large numbers of targets, and subhuman performance under conditions of occlusion. Our framework will enable future studies to connect the human behavioral and engineering literatures, so as to test image-computable multiple-object-tracking models as models of human performance and to investigate how tracking and recognition interact under natural conditions of dynamic motion and occlusion. Link » Benjamin Peters · Eivinas Butkus · Nikolaus Kriegeskorte 🔗 - Workshop version: How hard are computer vision datasets? Calibrating dataset difficulty to viewing time (Poster)  link » Humans outperform object recognizers despite the fact that models perform well on current datasets. Numerous efforts exist to make more challenging datasets by scaling up on the web, exploring distribution shift, or adding controls for biases. The difficulty of each image in each dataset is not independently evaluated, nor is the concept of dataset difficulty as a whole currently well defined. We develop a new dataset difficulty metric based on how long humans must view an image in order to classify a target object. Images whose objects can be recognized in 17ms are considered to be easier than those which require seconds of viewing time. Using 133,588 judgments on two major datasets, ImageNet and ObjectNet, we determine the distribution of image difficulties in those datasets, which we find varies wildly, but significantly undersamples hard images. Rather than hoping that distribution shift will lead to hard datasets, we should explicitly measure their difficulty. Analyzing model performance guided by image difficulty reveals that models tend to have lower performance and a larger generalization gap on harder images. We release a dataset of difficulty judgments as a complementary metric to raw performance and other behavioral/neural metrics. Such experiments with humans allow us to create a metric for progress in object recognition datasets. This metric can be used to both test the biological validity of models in a novel way, and develop tools to fill out the missing class of hard examples as datasets are being gathered. Link » David Mayo · Jesse Cummings · Xinyu Lin · Dan Gutfreund · Boris Katz · Andrei Barbu 🔗 - Redundancy and dependency in brain activities (Poster)  link » How many signals in the brain activities can be erased before the encoded information is lost? Surprisingly, we found that both reconstruction and classification of voxel activities can still achieve relatively good performance even after losing 80%-90% of the signals. This leads to questions regarding how the brain performs encoding in such a robust manner. This paper investigates the redundancy and dependency of brain signals using two deep learning models with minimal inductive bias (linear layers). Furthermore, we explored the alignment between the brain and semantic representations, how redundancy differs for different stimuli and regions, as well as the dependency between brain voxels and regions. Link » Sikun Lin · Thomas Sprague · Ambuj K Singh 🔗 - System identification of neural systems: If we got it right, would we know? (Poster)  link » Various artificial neural networks developed by engineers are now proposed as models of parts of the brain, such as the ventral stream in the primate visual cortex. After being trained on large datasets, the network activations are compared to recordings of biological neurons. A key question is how much the ability to predict neural responses actually tells us. In particular, do these functional tests about neurons activation allow us to distinguish between different model architectures? We benchmark existing techniques to correctly identify a model by replacing the brain recordings with recordings from a known ground truth neural network, using the most common identification methods. Even in the setting where the correct model is among the candidates, we find that system identification performance is quite variable, depending significantly on factors independent of the ground truth architecture, such as scoring function and dataset. In addition, we show limitations of the current approaches in identifying higher-level architectural motifs, such as convolution and attention. Link » Yena Han · Tomaso Poggio · Brian Cheung 🔗 - Cultural alignment of machine-vision representations (Poster)  link » Deep neural network representations of visual entities have been used as inputs to computational models of human mental representations. Though these models have been increasingly successful in predicting behavioral and physiological responses, the implicit notion of “human” that they rely upon often glosses over individual-level differences in subjective beliefs, attitudes, and associations, as well as group-level cultural constructs. Here, we align machine-vision representations to the consensus among a group of respondents by extending Cultural Consensus Theory to include latent constructs structured as fine-tuned pretrained machine-vision systems. We apply the model to a large-scale dataset of people’s first impressions of others. We show that our method creates a robust mapping between machine-vision representations and culturally constructed human representations. Link » Necdet Gurkan · Jordan Suchow 🔗 - Generalized Predictive Coding: Bayesian Inference in Static and Dynamic models (Poster)  link » Predictive coding networks (PCNs) have an inherent degree of biological plausibility and perform approximate backpropagation of error in supervised settings. It is less clear how predictive coding compares to state-of-the-art architectures, such as VAEs in unsupervised and probabilistic settings. We propose a generalized PCN that, like its inspiration in neuroscience, parameterizes hierarchical latent distributions under the Laplace approximation and maximises model evidence via iterative inference using local precision weighted error signals. Unlike its inspiration it uses multi-layer networks with nonlinearities between latent distributions. We compare our model to VAE and VLAE baselines on three different image datasets and find that generalized predictive coding shows performance comparable to variational autoencoders with exact error backpropagation. Finally, we evaluate the possibility of learning temporal dynamics via static prediction by encoding sequences of states in generalized coordinates of motion. Link » André Ofner · Beren Millidge · Sebastian Stober 🔗 - Primate Inferotemporal Cortex Neurons Generalize Better to Novel Image Distributions Than Analogous Deep Neural Networks Units (Poster)  link » Humans are successfully able to recognize objects in a variety of image distributions. Today's artificial neural networks (ANNs), on the other hand, struggle to recognize objects in many image domains, especially those different from the training distribution. It is currently unclear which parts of the ANNs could be improved in order to close this generalization gap. In this work, we used recordings from primate high-level visual cortex (IT) to isolate whether ANNs lag behind primate generalization capabilities because of their encoder (transformations up to the penultimate layer), or their decoder (linear transformation into class labels). Specifically, we fit a linear decoder on images from one domain and evaluate transfer performance on twelve held-out domains, comparing fitting on primate IT representations vs. representations in ANN penultimate layers. To fairly compare, we scale the number of each ANN's units so that its in-domain performance matches that of the sampled IT population (i.e. 71 IT neural sites, 73% binary-choice accuracy). We find that the sampled primate population achieves, on average, 68% performance on the held-out-domains. Comparably sampled populations from ANN model units generalize less well, maintaining on average 60%. This is independent of the number of sampled units: models' out-of-domain accuracies consistently lag behind primate IT. These results suggest that making ANN model units more like primate IT will improve the generalization performance of ANNs. Link » Marliawaty I Gusti Bagus · Tiago Marques · Sachi Sanghavi · James J DiCarlo · Martin Schrimpf 🔗 - Teacher-generated pseudo human spatial-attention labels boost contrastive learning models (Poster)  link » Human spatial attention conveys information about the regions of scenes that are important for performing visual tasks. Prior work has shown that the spatial distribution of human attention can be leveraged to benefit various supervised vision tasks. Might providing this weak form of supervision be useful for self-supervised representation learning? One reason why this question has not been previously addressed is that self-supervised models require large datasets, and no large dataset exists with ground-truth human attentional labels. We therefore construct an auxiliary teacher model to predict human attention, trained on a relatively small labeled dataset. This human-attention model allows us to provide an image (pseudo) attention labels for ImageNet. We then train a model with a primary contrastive objective; to this standard configuration, we add a simple output head trained to predict the attentional map for each image. We measured the quality of learned representations by evaluating classification performance from the frozen learned embeddings. We find that our approach improves accuracy of the contrastive models on ImageNet and its attentional map readout aligns better with human attention compared to vanilla contrastive learning models. Link » Yushi Yao · Chang Ye · Junfeng He · Gamaleldin Elsayed 🔗 - Model Stitching: Looking For Functional Similarity Between Representations (Poster)  link » Model stitching (Lenc \& Vedaldi 2015) is a compelling methodology to compare different neural network representations, because it allows us to measure to what degree they may be interchanged. We expand on a previous work from Bansal, Nakkiran \& Barak which used model stitching to compare representations of the same shapes learned by differently seeded and/or trained neural networks of the same architecture. Our contribution enables us to compare the representations learned by layers with different shapes from neural networks with different architectures. We subsequently reveal unexpected behavior of model stitching. Namely, we find that stitching, based on convolutions, for small ResNets, can reach high accuracy if those layers come later in the first (sender) network than in the second (receiver), even if those layers are far apart. This leads us to hypothesize that stitches are not in fact learning to match the representations expected by receiver layers, but instead finding different representations which nonetheless yield similar results. Thus, we believe that model stitching may not necessarily always be an accurate measure of similarity. Link » Adriano Hernandez · Rumen Dangovski · Peter Y. Lu 🔗 - Cortical Transformers: Robustness and Model Compression with Multi-Scale Connectivity Properties of the Neocortex. (Poster)  link » Transformer architectures in deep learning are increasingly relied on across domains with impressive results, but the observed growth of model parameters may be unsustainable and failures in robustness limit application. Tasks that are targeted across domains by transformers are enabled in biology by the mammalian neocortex, yet there is no clear understanding of the relationship between processing in the neocortex and the transformer architecture. While the relationship between convolutional neural networks (CNNs) and the cortex has been studied, transformers have more complex computations and multi-scale organization, offering a richer foundation for analysis and co-inspiration. We introduce a framework for enabling details of cortical connectivity at multiple organizational scales (micro-, meso-, and macro-) to be related to transformer processing, and investigate how cortical connectivity principles affect performance, using the CIFAR-10-C computer vision robustness benchmark task. Overall, we demonstrate the efficacy of our framework and find that incorporating components of cortical connectivity at multiple scales can reduce learnable attention parameters by over an order of magnitude, while being more robust against the most challenging examples in computer vision tasks. The cortical transformer framework and design changes we investigate are generalizable across domains, may inform the development of more efficient/robust attention-based systems, and further an understanding of the relationship between cortical and transformer processing. Link » Brian Robinson · Nathan Drenkow 🔗 - Probing Representations of Numbers in Vision and Language Models (Poster)  link » The ability to represent and reason about numbers in different contexts is an important aspect of human and animal cognition. Literature in numerical cognition posits the existence of two number representation systems: one for representing small, exact numbers, which is largely based on visual processing, and another system for representing larger, approximate quantities. In this work, we investigate number sense in vision and language models by examining learned representations and asking: What is the structure of the space representing numbers? Which modality contributes mostly to the representation of a number? While our analyses reveal that small numbers are processed differently from large numbers, as in biological systems, we also found a strong linguistic contribution in the structure of number representations in vision and language models, highlighting a difference between representations in biology and artificial systems. Link » Ivana Kajic · Aida Nematzadeh 🔗 - Topographic DCNNs trained on a single self-supervised task capture the functional organization of cortex into visual processing streams (Oral)  link » A key organizing principle of visual cortex is functional specialization, whether locally in the context of category-selective patches, or on a broader scale in the case of visual processing streams. Primate visual cortex has traditionally been divided into two such processing streams, though recent research suggests that there may be at least three functionally and anatomically distinct streams, extending along the ventral, lateral, and parietal surfaces of the brain. While processing streams are typically thought of within the framework of what downstream behaviors/tasks they support, we ask instead whether anatomical constraints may be sufficient to produce this differentiation, even within the context of just one task objective. Comparing directly to human fMRI responses, we show that a model trained on a single task, and with novel anatomical constraints (Topographic DCNN), can capture not only the functional responses but also the segregation of visual cortex into distinct processing streams. The match to human data is strongest for a self-supervised vs. supervised objective and when the anatomical constraint, which encourages local response correlations as proxy for minimizing wiring length, is appropriately weighted. These results suggest that the broad-scale functional organization of visual cortex into parallel processing streams may be explained by the pressure to minimize biophysical costs such as wiring length, and that local spatial constraints can surprisingly percolate to create broad-scale structure. Link » Dawn Finzi · Eshed Margalit · Kendrick Kay · Dan Yamins · Kalanit Grill-Spector 🔗 - Volumetric Neural Human for Robust Pose Optimization via Analysis-by-synthesis (Poster)  link » Regression-based approaches dominate the field of 3D human pose estimation, because of their quick fitting to distribution in a data-driven way. However, in this work we find the regression-based methods lack robustness under out-of-distribution, i.e. partial occlusion, due to its heavy dependence on the quality of prediction of 2D keypoints which are sensitive to partial occlusions. Inspired by the neural mesh models for object pose estimation, i.e. meshes combined with neural features, we introduce a human pose optimization approach via render-and-compare neural features. On the other hand, the volume rendering technical demonstrate better representation with accurate gradients for reasoning occlusions. In this work, we develop a volumetric human representation and a robust inference pipeline via volume rendering with gradient-based optimizations, which synthesize neural features during inference while gradually updating the human pose via maximizing the feature similarities. Experiments on 3DPW show ours better robustness to partial occlusion with competitive performance on unoccluded cases. Link » Pengliang Ji · Angtian Wang · Yi Zhang · Adam Kortylewski · Alan Yuille 🔗 - What does an Adversarial Color look like? (Poster)  link » The short-answer: it depends! The long-answer is that this dependence is modulated by several factors including the architecture, dataset, optimizer and initialization. In general, this modulation is likely due to the fact that artificial perceptual systems are best suited for tasks that are aligned with their level of compositionality, so when these perceptual systems are optimized to perform a global task such as average color estimation instead of object recognition (which is compositional), different representations emerge in the optimized networks. In this paper, we first assess the novelty of our experiment and define what an adversarial example is in the context of the color estimation task. We then run controlled experiments in which we vary 4 variables in a highly controlled way pertaining neural network hyper-parameters such as: 1) the architecture, 2) the optimizer, 3) the dataset, and 4) the weight initializations. Generally, we find that a fully connected network's attack vector is more sparse than a compositional CNN's, although the SGD optimizer will modulate the attack vector to be less sparse regardless of the architecture. We also discover that the attack vector of a CNN is more consistent across varying datasets and confirm that the CNN is more robust to attacks of adversarial color. Altogether, this paper presents a first computational exploration of the qualitative assessment of the adversarial perception of color in simple neural network models, re-emphasizing that studies in adversarial robustness and vulnerability should extend beyond object recognition. Link » John Chin · Arturo Deza 🔗 - Reconstruction-guided attention improves the robustness and shape processing of neural networks (Poster)  link » Many visual phenomena suggest that humans use top-down generative or reconstructive processes to create visual percepts (e.g., imagery, object completion, pareidolia), but little is known about the role reconstruction plays in robust object recognition. We built an iterative encoder-decoder network that generates an object reconstruction and used it as top-down attentional feedback to route the most relevant spatial and feature information to feed-forward object recognition processes. We tested this model using the challenging out-of-distribution digit recognition dataset, MNIST-C, where 15 different types of transformation and corruption are applied to handwritten digit images. Our model showed strong generalization performance against various image perturbations, on average outperforming all other models including feedforward CNNs and adversarially trained networks. Our model is particularly robust to blur, noise, and occlusion corruptions, where shape perception plays an important role. Ablation studies further reveal two complementary roles of spatial and feature-based attention in robust object recognition, with the former largely consistent with spatial masking benefits in the attention literature (the reconstruction serves as a mask) and the latter mainly contributing to the model’s inference speed (i.e., number of time steps to reach a certain confidence threshold) by reducing the space of possible object hypotheses. We also observed that the model sometimes hallucinates a non-existing pattern out of noise, leading to highly interpretable human-like errors. Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism, which can help us understand the role of generating perception in human visual processing. Link » Seoyoung Ahn · Hossein Adeli · Greg Zelinsky 🔗 - Hierarchical Abstraction for Combinatorial Generalization in Object Rearrangement (Poster)  link » Object rearrangement is a challenge for embodied agents because solving these tasks requires generalizing across a combinatorially large set of underlying entities that take the value of object states. Worse, these entities are often unknown and must be inferred from sensory percepts. We present a hierarchical abstraction approach to uncover these underlying entities and achieve combinatorial generalization from unstructured inputs. By constructing a factorized transition graph over clusters of object representations inferred from pixels, we show how to learn a correspondence between intervening on states of entities in the agent's model and acting on objects in the environment. We use this correspondence to develop a method for control that generalizes to different numbers and configurations of objects, which outperforms current offline deep RL methods when evaluated on a set of simulated rearrangement and stacking tasks. Link » Michael Chang · Alyssa L Dayan · Franziska Meier · Tom Griffiths · Sergey Levine · Amy Zhang 🔗 - Exploring the role of image domain in self-supervised DNN models of rodent brains (Poster)  link » Biological visual systems have evolved around the efficient coding of natural image statistics in order to support recognition of complex visual patterns. Recent work shown that deep neural networks are able to learn similar representations to those measured in visual areas in animals, suggesting they may serve as models for the brain. Varying network architectures and loss functions has been shown to modulate the biological similarity learned representations, however the extent to which this results from exposure to natural image statistics during training has not been fully characterized. Here, we use self-supervised learning to train neural network models across a range of data domains with different image statistics and evaluate the similarity of the learned representations to neural activity of the mouse visual cortex. We find that networks trained on different domains also exhibit different responses when shown held-out natural images. Furthermore, we find that the degree of biological similarity of the representations generally increases as a function of the naturalness of the data domain used for training. Our results provide evidence for the idea that that the training data domain is an important component when modeling the visual system using deep neural networks. Link » Aaditya Prasad · Uri Manor · Talmo Pereira 🔗 - Top-down effects in an early visual cortex inspired hierarchical Variational Autoencoder (Oral)  link » Interpreting computations in the visual cortex as learning and inference in a generative model of the environment has received wide support both in neuroscience and cognitive science. However, hierarchical computations, a hallmark of visual cortical processing, have remained impervious for generative models because of the lack of adequate tools to address it. Here, we capitalize on advances in Variational Autoencoders (VAEs) to investigate the early visual cortex with sparse-coding two-layer hierarchical VAEs trained on natural images. We show that representations similar to those found in the primary and secondary visual cortices naturally emerge under mild inductive biases. That is, the high-level latent space represents texture-like patterns reminiscent of the secondary visual cortex. We show that a neuroscience-inspired choice of the recognition model is important for learning noise correlations, performing image inpainting, and detecting illusory edges. We argue that top-down interactions, a key feature of biological vision, born out naturally from hierarchical inference. We also demonstrate that model predictions are in line with existing V1 measurements in macaques with regard to noise correlations and illusory contour stimuli. Link » Ferenc Csikor · Balázs Meszéna · Bence Szabó · Gergo Orban 🔗 - Comparing Intuitions about Agents’ Goals, Preferences and Actions in Human Infants and Video Transformers (Poster)  link » Although AI has made large strides in recent years, state-of-the-art models still largely lack core components of social cognition which emerge early on in infant development. The Baby Intuitions Benchmark was explicitly designed to compare these "commonsense psychology" abilities in humans and machines. Recurrent neural network-based models previously applied to this dataset have been shown to not capture the desired knowledge. We here apply a different class of deep learning-based model, namely a video transformer, and show that it quantitatively more closely matches infant intuitions. However, qualitative error analyses show that model is prone to exploiting particularities of the training data for its decisions. Link » Alice Hein · Klaus Diepold 🔗 - Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4 (Poster)  link » Modern high-scoring models of vision in the brain score competition do not stem from Vision Transformers. However, in this paper, we provide evidence against the unexpected trend of Vision Transformers (ViT) being not perceptually aligned with human visual representations by showing how a dual-stream Transformer, a CrossViT $~\textit{a la}$ Chen et. al. (2021), under a joint rotationally-invariant and adversarial optimization procedure yields 2nd place in the aggregate Brain-Score 2022 competition (Schrimpf et al., 2020b) averaged across all visual categories, and at the time of the competition held 1st place for the highest explainable variance of area V4. In addition, our current Transformer-based model also achieves greater explainable variance for areas V4, IT, and Behaviour than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like computation module (Dapello et al., 2020). To assess the contribution of the optimization scheme with respect to the CrossViT architecture, we perform several additional experiments on differently optimized CrossViT's regarding adversarial robustness, common corruption benchmarks, mid-ventral stimuli interpretation, and feature inversion. Against our initial expectations, our family of results provides tentative support for an $\textit{All roads lead to Rome''}$ argument enforced via a joint optimization rule even for non biologically-motivated models of vision such as Vision Transformers. Link » William Berrios · Arturo Deza 🔗 - Distinguishing representational geometries with controversial stimuli: Bayesian experimental design and its application to face dissimilarity judgments (Oral)  link » Comparing representations of complex stimuli in neural network layers to human brain representations or behavioral judgments can guide model development. However, even qualitatively distinct neural network models often predict similar representational geometries of typical stimulus sets. We propose a Bayesian experimental design approach to synthesizing stimulus sets for adjudicating among representational models. We apply our method to discriminate among alternative neural network models of behavioral face similarity judgments. Our results indicate that a neural network trained to invert a 3D-face-model graphics renderer is more human-aligned than the same architecture trained on identification, classification, or autoencoding. Our proposed stimulus synthesis objective is generally applicable to designing experiments to be analyzed by representational similarity analysis for model comparison. Link » Tal Golan · Wenxuan Guo · Heiko Schütt · Nikolaus Kriegeskorte 🔗 - Human alignment of neural network representations (Poster)  link » Today’s computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect alignment between the representations learned by neural networks and human concept representations. Human representations are inferred from behavioral responses in an odd-one-out triplet task, where humans were presented with three images and had to select the odd-one-out. We find that model scale and architecture have essentially no effect on alignment with human behavioral responses, whereas the training dataset and objective function have a much larger impact. Using a sparse Bayesian model of human conceptual representations, we partition triplets by the concept that distinguishes the two similar images from the odd-one-out, finding that some concepts such as food and animals are well-represented in neural network representations whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans. Link » Lukas Muttenthaler · Lorenz Linhardt · Jonas Dippel · Robert Vandermeulen · Simon Kornblith 🔗 - Learning to Look by Self-Prediction (Oral)  link » We present a method for learning active vision skills, for moving the camera to observe a robot's sensors from informative points of view, without external rewards or labels. We do this by jointly training a visual predictor network, which predicts future returns of the sensors using pixels, and a camera control agent, which we reward using the negative error of the predictor. The agent thus moves the camera to points of view that are most predictive for a target sensor, which we select using a conditioning input to the agent. We show that despite this noisy learned reward function, the learned policies are competent, and precisely frame the sensor to a specific location in the view, which we call an emergent fovea. We find that replacing the conventional camera with a foveal camera further increases the policies' precision. Link » Matthew Grimes · Joseph Modayil · Piotr Mirowski · Dushyant Rao · Raia Hadsell 🔗 - Enforcing Object Permanence using Hierarchical Object-Centric Generative Models (Poster)  link » Object permanence is an important milestone in infant development, when the infant understands that an object continues to exist even when it no longer can be seen. However, current machine learning methods devised to build a world model to predict the future still fail at this task when having to deal with longer time sequences and severe occlusions. In this paper, we compare current machine learning with infant learning, and propose an object-centric approach on learning predictive models. This grounds object representations to an inferred location, effectively resolving the object permanence problem. We demonstrate performance on a novel object-permanence task in a simulated 3D environment. Link » Toon Van de Maele · Stefano Ferraro · Tim Verbelen · Bart Dhoedt 🔗 - Adding neuroplasticity to a CNN-based in-silico model of neurodegeneration (Poster)  link » The aim of this work was to enhance the biological feasibility of a deep convolutional neural network-based in-silico model of neurodegeneration of the visual system by adding neuroplasticity to it. Therefore, deep convolutional networks were trained for object recognition tasks and progressively lesioned to simulate the onset of posterior cortical atrophy, a condition that affects the visual cortex in patients with Alzheimer’s disease (AD). After each iteration of injury, the networks were retrained on the training set to simulate the continual plasticity of the human brain, when affected by a neurodegenerative disease. More specifically, the injured parts of the network remained injured while we investigated how the added retraining steps were able to recover some of the model’s baseline performance. The results showed that with retraining, a model’s object recognition abilities are subject to a smoother decline with increasing injury levels than without retraining and, therefore, more similar to the longitudinal cognition impairments of patients diagnosed with AD. Moreover, with retraining, the injured model exhibits internal activation patterns similar to those of the healthy baseline model compared to the injured model without retraining. In conclusion, adding retraining to the in-silico setup improves the biological feasibility considerably and could prove valuable to test different rehabilitation approaches in-silico. Link » Jasmine Moore · Matthias Wilms · Kayson Fakhar · Fatemeh Hadaeghi · Claus Hilgetag · Nils Daniel Forkert 🔗