Timezone: »
Neural models tend to learn similar representations when subject to similar stimuli; this behavior has been observed both in biological and artificial settings.The emergence of these similar representations is igniting a growing interest in the fields of neuroscience and artificial intelligence. To gain a theoretical understanding of this phenomenon, promising directions include: analyzing the learning dynamics and studying the problem of identifiability in the functional and parameter space. This has strong consequences in unlocking a plethora of applications in ML from model fusion, model stitching, to model reuse and in improving the understanding of biological and artificial neural models. The objective of the workshop is to discuss theoretical findings, empirical evidence and practical applications of this phenomenon, benefiting from the cross-pollination of different fields (ML, Neuroscience, Cognitive Science) to foster the exchange of ideas and encourage collaborations.Overall the questions we aim to investigate are when, why and how internal representations of distinct neural models can be unified into a common representation.
Fri 6:15 a.m. - 6:30 a.m.
|
Opening Remarks
(
Talk
)
|
🔗 |
-
|
Characterizing pre-trained and task-adapted molecular representations
(
Poster
)
link »
Pre-trained deep learning models are emerging fast as a tool for enhancing scientific workflow and accelerating scientific discovery. Representation learning is a fundamental task to study the molecular structure–property relationship, which is then leveraged for predicting the molecular properties or designing new molecules with desired attributes. However, evaluating the emerging "zoo" of pre-trained models for various downstream tasks remains challenging. We propose an unsupervised method to characterize embeddings of pre-trained models through the lens of non-parametric group property-driven subset scanning (SS). We assess its detection capabilities with extensive experiments on diverse molecular benchmarks (ZINC-250K, MOSES, MoleculeNet) across predictive chemical language models (MoLFormer, ChemBERTa) and molecular graph generative models (GraphAF, GCPN). We further evaluate how representations evolve as a result of domain adaptation by finetuning or low-dimensional projection.Experiments reveal notable information condensation in the pre-trained embeddings upon task-specific fine-tuning as well as projection techniques. For example, among the top-$120$ most-common elements in the embedding (out of $\approx 700$), only $11$ property-driven elements are shared between the three tasks (BACE, BBBP, and HIV), while $\approx 70$-$80$ of those are unique to each task. This work provides a post-hoc quality evaluation method for representation learning models and domain adaptation methods that is task and modality-agnostic.
|
Celia Cintas · Payel Das · Jarret Ross · Brian Belgodere · Girmaw Abebe Tadesse · Vijil Chenthamarakshan · Jannis Born · Skyler D. Speakman 🔗 |
-
|
Duality of Bures and Shape Distances with Implications for Comparing Neural Representations
(
Poster
)
link »
A multitude of (dis)similarity measures between neural networks representations have been proposed, resulting in a fragmented research landscape. Most (dis)similarity measures fall into one of two categories. First, measures such as linear regression, canonical correlations analysis (CCA), and shape distances, all learn explicit mappings between neural units to quantify similarity while accounting for expected invariances. Second, measures such as representational similarity analysis (RSA), centered kernel alignment (CKA), and normalized Bures similarity (NBS) all quantify similarity in summary statistics that are already invariant to such symmetries (e.g. by comparing stimulus-by-stimulus kernel matrices). Here, we take steps towards unifying these two broad categories of methods by observing that the cosine of the Riemannian shape distance (from category 1) is equal to NBS (from category 2). We explore how this connection leads to new interpretations of shape distances and NBS, and draw contrasts of these measures with CKA, a popular similarity measure in the deep learning literature. |
Sarah Harvey · Brett Larsen · Alex Williams 🔗 |
-
|
WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability
(
Poster
)
link »
Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting attention learning in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer’s performance and also significantly outperform learning in Fourier space. We further show our method can enhance Transformer’s reasoning extrapolation capability over distance on the LEGO chain-of-reasoning task. |
Yufan Zhuang · Zihan Wang · Fangbo Tao · Jingbo Shang 🔗 |
-
|
NEUCORE: Neural Concept Reasoning for Composed Image Retrieval
(
Poster
)
link »
Composed image retrieval which combines a reference image and a text modifier to identify the desired target image is a challenging task, and requires the model to comprehend both vision and language modalities and their interactions. Existing approaches focus on holistic multi-modal interaction modeling, and ignore the composed and complimentary property between the reference image and text modifier. In order to better utilize the complementarity of multi-modal inputs for effective information fusion and retrieval, we move the multi-modal understanding to fine-granularity at concept-level, and learn the multi-modal concept alignment to identify the visual location in reference or target images corresponding to text modifier. Toward the end, we propose a NEUral COncept REasoning (NEUCORE) model which incorporates multi-modal concept alignment and progressive multi-modal fusion over aligned concepts. Specifically, considering that text modifier may refer to semantic concepts not existing in the reference image and requiring to be added into the target image, we learn the multi-modal concept alignment between the text modifier and the concatenation of reference and target images, under multiple-instance learning framework with image and sentence level weak supervision. Furthermore, based on aligned concepts, to form discriminative fusion features of the input modalities for accurate target image retrieval, we propose a progressive fusion strategy with unified execution architecture instantiated by the attended language semantic concepts. Our proposed approach is evaluated on three datasets and achieves state-of-the-art results. |
Shu Zhao · Huijuan Xu 🔗 |
-
|
Grokking as Simplification: A Nonlinear Complexity Perspective
(
Poster
)
link »
We attribute grokking, the phenomenon where generalization is much delayed after memorization, to compression. We define linear mapping number (LMN) to measure network complexity, which is a generalized version of linear region number for ReLU networks. LMN can nicely characterize neural network compression before generalization. Although $L_2$ norm has been popular to characterize model complexity, we argue in favor of LMN for a number of reasons: (1) LMN can be naturally interpreted as information/computation, while $L_2$ cannot. (2) In the compression phase, LMN has nice linear relations with test losses, while $L_2$ is correlated with test losses in a complicated nonlinear way. (3) LMN also reveals an intriguing phenomenon of the XOR network switching between two generalization solutions, while $L_2$ does not. Besides explaning grokking, we argue that LMN is a promising candidate as the neural network version of the Kolmogorov complexity, since it explicitly considers local or conditioned linear computations aligned with the nature of modern artificial neural networks.
|
Ziming Liu · Ziqian Zhong · Max Tegmark 🔗 |
-
|
Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference
(
Poster
)
link »
We propose a method to improve the efficiency and accuracy of amortized Bayesian inference (ABI) by leveraging universal symmetries in the probabilistic joint model $p(\theta, y)$ of parameters $\theta$ and data $y$. In a nutshell, we invert Bayes' theorem and estimate the marginal likelihood based on approximate representations of the joint model. Upon perfect approximation, the marginal likelihood is constant across all parameter values by definition. However, approximation error leads to undesirable variance in the marginal likelihood estimates across different parameter values. We formulate violations of this symmetry as a loss function to accelerate the learning dynamics of conditional neural density estimators. We apply our method to a bimodal toy problem with an explicit likelihood (likelihood-based) and a realistic model with an implicit likelihood (simulation-based).
|
Marvin Schmitt · Daniel Habermann · Paul-Christian Bürkner · Ullrich Köthe · Stefan Radev 🔗 |
-
|
Efficient Multimodal Alignment: To Freeze or Not to Freeze?
(
Poster
)
link »
Language-image pretraining creates a joint representation space between the two modalities where images and texts with similar semantic information lay close to each other. Language-image models are often trained from scratch without taking advantage of unimodal pretrained models. By aligning the representation spaces of two modality-specific encoders, our model achieves 74.7% accuracy on the ImagenNet1K validation set, at two orders of magnitude lower training cost. In this work, we highlight the importance of unfreezing the CLS tokens of uni-modal transformer encoders to create a joint embedding space. Freezing the image and text CLS tokens reduces the mean accuracy from 37.5% to 19.4% on the 38 evaluation benchmarks. |
Till Aczel · Roger Wattenhofer 🔗 |
-
|
What Does Knowledge Distillation Distill?
(
Poster
)
link »
Knowledge distillation is an increasingly-used compression method due to the popularity of large-scale models, but it is unclear if all information a teacher model contains is distilled into the smaller student model. We aim to formalize the concept of `knowledge' to investigate how knowledge is transferred during distillation, focusing on shared invariances to counterfactual changes to dataset latent variables (which we call mechanisms). We define good stand-in student models for the teacher as models that share the teacher's mechanisms, and find Jacobian matching and contrastive representation learning as viable methods to achieve good students. While these methods do not result in perfect transfer of mechanisms, they are likely to improve student fidelity or mitigate simplicity bias (as measured by teacher-student KL divergence and accuracy on various out-of-distribution test datasets), especially on datasets with certain spurious statistical correlations. |
Cindy Wu · Ekdeep S Lubana · Bruno Mlodozeniec · Robert Kirk · David Krueger 🔗 |
-
|
Comparing Representational and Functional Similarity in Small Transformer Language Models
(
Poster
)
link »
In many situations, it would be helpful to be able to characterize the solution learned by a neural network, including for answering scientific questions (e.g. how do architecture changes affect generalization) and addressing practical concerns (e.g. auditing for potentially unsafe behavior). One approach is to try to understand these models by studying the representations that they learn---for example, comparing whether two networks learn similar representations. However, it is not always clear how much representation-level analyses can tell us about how a model makes predictions. In this work, we explore this question in the context of small Transformer language models, which we train on a synthetic, hierarchical language task. We train models with different sizes and random initializations, evaluating performance over the course of training and on a variety of systematic generalization splits. We find that existing methods for measuring representation similarity are not always correlated with behavioral metrics---i.e. models with similar representations do not always make similar predictions---and the results vary depending on the choice of representation. Our results highlight the importance of understanding representations in terms of the role they play in the neural algorithm. |
Dan Friedman · Andrew Lampinen · Lucas Dixon · Danqi Chen · Asma Ghandeharioun 🔗 |
-
|
Representational constraints underlying similarity between task-optimized neural systems
(
Poster
)
link »
In this study, we investigate the similarity of representations between biological and artificial visual systems that are optimized for object recognition. We propose that this similarity could be a result of constraints on the representation of task-optimized systems, which necessitate the development of an abstraction from the input stimuli. To measure this, we constructed a two-dimensional coordination system in which we measured the distance of each neural representation from the pixel space and the class space. Our results show that proximity in this space predicts the similarity of neural representations between different visual systems. We observe that the trajectories of representations in any given task-optimized visual neural network start close to the pixel space and gradually move towards higher abstract representations such as categories. This suggests that the similarity between different task-optimized systems is due to constraints on representational trajectories, as revealed by the abstraction space. We present abstraction space as a simple yet effective analysis tool to draw inferences on the representations of neural network and to uncover the constraints that lead to similar representations in different visual systems. |
Tahereh Toosi 🔗 |
-
|
A Compact Representation for Bayesian Neural Networks By Removing Permutation Symmetry
(
Poster
)
link »
Bayesian neural networks (BNNs) are a principled approach to modeling predictive uncertainties in deep learning, which are important in safety-critical applications. Since exact Bayesian inference over the weights in a BNN is intractable, various approximate inference methods exist, among which sampling methods such as Hamiltonian Monte Carlo (HMC) are often considered the gold standard. While HMC provides high-quality samples, it lacks interpretable summary statistics because its sample mean and variance is meaningless in neural networks due to permutation symmetry. In this paper, we first show that the role of permutations can be meaningfully quantified by a number of transpositions metric. We then show that the recently proposed rebasin method allows us to summarize HMC samples into a compact representation that provides a meaningful explicit uncertainty estimate for each weight in a neural network, thus unifying sampling methods with variational inference. We show that this compact representation allows us to compare trained BNNs directly in weight space across sampling methods and variational inference, and to efficiently prune neural networks trained without explicit Bayesian frameworks by exploiting uncertainty estimates from HMC. |
Tim Xiao · Weiyang Liu · Robert Bamler 🔗 |
-
|
ReWaRD: Retinal Waves for Pre-Training Artificial Neural Networks Mimicking Real Prenatal Development
(
Poster
)
link »
Computational models trained on a large amount of natural images are the state-of-the-art to study human vision -- usually adult vision. Computational models of infant vision and its further development are gaining more and more attention in the community. In this work we aim at the very beginning of our visual experience -- pre- and post-natal retinal waves which suggest to be a pre-training mechanism for the human visual system at a very early stage of development. We see this approach as an instance of biologically plausible data driven inductive bias through pre-training. We built a computational model that mimics this development mechanism by pre-training different artificial convolutional neural networks with simulated retinal wave images. The resulting features of this biologically plausible pre-training closely match the V1 features of the human visual system. We show that the performance gain by pre-training with retinal waves is similar to a state-of-the art pre-training pipeline. Our framework contains the retinal wave generator, as well as a training strategy, which can be a first step in a curriculum learning based training diet for various models of development. We release code, data and trained networks to build the basis for future work on visual development and based on a curriculum learning approach including prenatal development to support studies of innate vs. learned properties of the human visual system. An additional benefit of our pre-trained networks for neuroscience or computer vision applications is the absence of biases inherited from datasets like ImageNet. |
Benjamin Cappell · Andreas Stoll · Chukwudi Umah · Bernhard Egger 🔗 |
-
|
Multimodal decoding of human brain activity into images and text
(
Poster
)
link »
Every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. We utilized the Natural Scenes Dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. We employed the Generative Image-to-text Transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. The method involves training regularized linear regression models between brain activity and extracted features. Additionally, we incorporated depth maps from the ControlNet model to further guide the reconstruction process. We propose a multimodal based approach that leverage similarities between neural and deep learning presentation and by learning alignment between these spaces, we produce textual description and image reconstruction from brain activity.We evaluate our methods using quantitative metrics for both generated captions and images. Our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. Our approach provides a flexible platform for future research, with potential applications based on a combination of high-level semantic information coming from text and low-level image shape information coming from depth maps and initial guess images. |
Matteo Ferrante · Tommaso Boccato · Furkan Ozcelik · Rufin VanRullen · Nicola Toschi 🔗 |
-
|
Predictive variational autoencoder for learning robust representations of time-series data
(
Poster
)
link »
Variational autoencoders (VAEs) have been used extensively to discover low-dimensional latent factors governing neural activity and animal behavior. However, without careful model selection, the uncovered latent factors may reflect noise in the data rather than true underlying features, rendering such representations unsuitable for scientific interpretation. Existing solutions to this problem involve introducing additional measured variables or data augmentations specific to a particular data type. We propose a VAE architecture that predicts the next point in time and show that it mitigates the learning of spurious features. In addition, we introduce a model selection metric based on smoothness over time in the latent space. We show that together these two constraints on VAEs to be smooth over time produce robust latent representations and faithfully recover latent factors on synthetic datasets. |
Julia Wang · Dexter Tsin · Tatiana Engel 🔗 |
-
|
Instruction-tuned LLMs with World Knowledge are More Aligned to the Human Brain
(
Poster
)
link »
Instruction-tuning is a widely adopted method of finetuning that enables large language models (LLMs) to generate output that more closely resembles human responses to natural language queries, in many cases leading to human-level performance on diverse testbeds. However, it remains unclear whether instruction-tuning truly makes LLMs more similar to how humans process language. We investigate the effect of instruction-tuning on brain alignment, the similarity of LLM internal representations to neural activity in the human language system. We assess 25 vanilla and instruction-tuned LLMs across three datasets involving humans reading naturalistic stories and sentences, and discover that instruction-tuning generally enhances brain alignment by an average of 6%. To identify the factors underlying LLM-brain alignment, we compute the correlation between the brain alignment of LLMs and various model properties, such as model size, performance ability on problem-solving benchmarks, and ability on benchmarks requiring world knowledge spanning various domains. Notably, we find a strong positive correlation between brain alignment and model size (r = 0.95), as well as performance on tasks requiring world knowledge (r = 0.81). Our results demonstrate that instruction-tuning LLMs improves both world knowledge representations and human brain alignment, suggesting that mechanisms that encode world knowledge in LLMs also improve representational alignment to the human brain. |
Khai Loong Aw · Syrielle Montariol · Badr AlKhamissi · Martin Schrimpf · Antoine Bosselut 🔗 |
-
|
On Transferring Expert Knowledge from Tabular Data to Images
(
Poster
)
link »
Transferring knowledge across modalities has garnered significant attention in the field of machine learning as it enables the utilization of expert knowledge from diverse domains. In particular, the representation of expert knowledge in tabular form, commonly found in fields such as medicine, can greatly enhance the comprehensiveness and accuracy of image-based learning.However, the transfer of knowledge from tabular to image data presents unique challenges due to the distinct characteristics of these data types, making it challenging to determine "how to reuse" and "which subset to reuse". To address this, we propose a novel method called CHannel tAbulaR alignment with optiMal tranSport (CHARMS) that automatically and effectively transfers relevant tabular knowledge. Specifically, by maximizing the mutual information between a group of channels and tabular features, our method modifies the visual embedding and captures the semantics of tabular knowledge. The alignment between channels and attributes helps select the subset of tabular data which contains knowledge to images. Experimental results demonstrate that {\sc Charms} effectively reuses tabular knowledge to improve the performance and interpretability of visual classifiers. |
Jun-Peng Jiang · Han-Jia Ye · Leye Wang · Yang Yang · Yuan Jiang · De-Chuan Zhan 🔗 |
-
|
On the Robustness of Neural Collapse and the Neural Collapse of Robustness
(
Poster
)
link »
Neural Collapse refers to the curious phenomenon in the end of training of a neural network, where feature vectors and classification weights converge to a very simple geometrical arrangement (a simplex). While it has been observed empirically in various cases and has been theoretically motivated, its connection with crucial properties of neural networks, like their generalization and robustness, remains unclear. In this work, we study the stability properties of these simplices. We find that the simplex structure disappears under small adversarial attacks, and that perturbed examples ``leap" between simplex vertices. We further analyze the geometry of networks that are optimized to be robust against adversarial perturbations of the input, and find that Neural Collapse is a pervasive phenomenon in these cases as well, with clean and perturbed representations forming aligned simplices, and giving rise to a robust simple nearest-neighbor classifier. By studying the propagation of the amount of collapse inside the network, we identify novel properties of both robust and non-robust machine learning models, and show that earlier, unlike later layers maintain reliable simplices on perturbed data. |
Jingtong Su · Ya Shi Zhang · Nikolaos Tsilivis · Julia Kempe 🔗 |
-
|
MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation
(
Poster
)
link »
Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a unified framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre-training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset. |
Muhammad Osama Khan · Junbang Liang · Chun-Kai Wang · Shan Yang · Yu Lou 🔗 |
-
|
Linearly Structured World Representations in Maze-Solving Transformers
(
Poster
)
link »
The emergence of seemingly similar representations across tasks and neural architectures suggests that convergent properties may underlie sophisticated behavior. One form of representation that seems particularly fundamental to reasoning in many artificial (and perhaps natural) networks is the formation of world models, which decompose observed task structures into re-usable perceptual primitives and task-relevant relations. In this work, we show that auto-regressive transformers tasked with solving mazes learn to linearly represent the structure of mazes, and that the formation of these representations coincides with a sharp increase in generalization performance. Furthermore, we find preliminary evidence for Adjacency Heads which may play a role in computing valid paths through mazes. |
Michael Ivanitskiy · Alexander Spies · Tilman Räuker · Guillaume Corlouer · Christopher Mathwin · Lucia Quirke · Can Rager · Rusheb Shah · Dan Valentine · Cecilia Diniz Behn · Katsumi Inoue · Samy Wu Fung
|
-
|
A General Method for Testing Bayesian Models using Neural Data
(
Poster
)
link »
Bayesian models have been successful in explaining human and animal behavior, but the extent to which they can also explain neural activity is still an open question. A major obstacle to answering this question is that current methods for generating neural predictions require detailed and specific assumptions about the encoding of posterior beliefs in neural responses, with no consensus or decisive data about the nature of this encoding. Here, we present a new method and prove conditions for its validity, that overcomes these challenges for a wide class of probabilistic encodings -- including the two major classes of neural sampling and distributed distributional codes. Our method tests whether the relationships between the model posteriors for different stimuli match the relationships between the corresponding neural responses -- akin to representational similarity analysis (RSA), a widely used method for nonprobabilistic models. Finally, we present a new model comparison diagnostic for our method, based not on the agreement of the model with the data directly, but on the alignment of the model and data when injecting noise in our neural prediction generation method. We illustrate our method using simulated V1 data and compare two Bayesian models that are practically indistinguishable using behavior alone. Our results show a powerful new way to rigorously test Bayesian models on neural data. |
Gabor Lengyel · Sabyasachi Shivkumar · Ralf Haefner 🔗 |
-
|
ZipIt!: Multitask Model Merging without Training
(
Poster
)
link »
We tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model $\textbf{without any additional training}$. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features $\textit{within}$ each model by defining a general "zip" operation. Second, we add support for $\textit{partially zipping}$ the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for a staggering 20-50% improvement over prior work,
|
George Stoica · Daniel Bolya · Jakob Bjorner · Pratik Ramesh · Taylor Hearn · Judy Hoffman 🔗 |
-
|
Degradation and plasticity in convolutional neural networks: An investigation of internal representations
(
Poster
)
link »
The architecture and information processing of convolutional neural networks was originally heavily inspired by the biological visual system. In this work, we make use of these similarities to create an in silico model of neurodegenerative diseases affecting the visual system. We examine layer-wise internal representations and accuracy levels of the model as it is subjected to synaptic decay and retraining to investigate if it is possible to capture a biologically realistic profile of visual cognitive decline. Therefore, we progressively decay and freeze model synapses in a highly compressed model trained for object recognition. Between each iteration of progressive model degradation, we retrain the remaining unaffected synapses on subsets of initial training data to simulate continual neuroplasticity. The results of this work show that even with high levels of synaptic decay and limited retraining data, the model is able to regain internal representations similar to that of the unaffected, healthy model. We also demonstrate that throughout a complete cycle of model degradation, the early layers of the model retain high levels of centered kernel alignment similarity, while later layers containing high-level information are much more susceptible to deviate from the healthy model. |
Jasmine Moore · Vibujithan Vigneshwaran · Matthias Wilms · Nils Daniel Forkert 🔗 |
-
|
On the Direct Alignment of Latent Spaces
(
Poster
)
link »
With the wide adaption of deep learning and pre-trained models rises the question of how to effectively reuse existing latent spaces for new applications.One important question is how the geometry of the latent space changes in-between different training runs of the same architecture and different architectures trained for the same task. Previous works proposed that the latent spaces for similar tasks are approximately isometric. However, in this work we show that method restricted to this assumption perform worse than when just using a linear transformation to align the latent spaces. We propose directly computing a transformation between the latent codes of different architectures which is more efficient than previous approaches and flexible wrt. to the type of transformation used. Our experiments show that aligning the latent space with a linear transformation performs best while not needing more prior knowledge. |
Zorah Lähner · Michael Moeller 🔗 |
-
|
Reinforcement Learning with Augmentation Invariant Representation: A Non-contrastive Approach
(
Poster
)
link »
Data augmentation has been proven as an effective measure to improve generalization performance in reinforcement learning (RL). However, recent approaches directly use the augmented data to learn the value estimate or regularize the estimation, often ignoring the core essence that the model needs to learn that augmented data indeed represents the same state. In this work, we present RAIR: Reinforcement learning with Augmentation Invariant Representation that disentangles the representation learning task from the RL task and aims to learn similar latent representations for the original observation and the augmented one. Our approach learns the representation of high-dimensional visual observations in a non-contrastive self-supervised way combined with the standard RL objective. In particular, RAIR gradually pushes the latent representation of an observation closer to the representation produced for the corresponding augmented observations. Thus, our agent is more resilient to the changes in the environment. We evaluate RAIR on all sixteen environments from the RL generalization benchmark Procgen. The experimental results indicate that RAIR outperforms PPO and other data augmentation-based approaches under the standard evaluation protocol. |
Nasik Muhammad Nafi · William Hsu 🔗 |
-
|
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
(
Poster
)
link »
The landscape of publicly available vision foundation models (VFMs), such as CLIP and SAM, is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pretraining objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe based on multi-task distillation to efficiently merge VFMs into a unified model that assimilates their expertise. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces \emph{synergistic functionalities}, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8\% and +5.9\% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively. |
Haoxiang Wang · Pavan Kumar Anasosalu Vasu · Fartash Faghri · Raviteja Vemulapalli · Mehrdad Farajtabar · Sachin Mehta · Mohammad Rastegari · Oncel Tuzel · Hadi Pouransari 🔗 |
-
|
Comparing neural models using their perceptual discriminability predictions
(
Poster
)
link »
A variety of methods have been developed to compare models of visual representation. However, internal representations are not uniquely identifiable from perceptual measurements: different representations can generate identical perceptual predictions, and dissimilar model representations (according to existing model comparison methods) do not guarantee dissimilar perceptual predictions. Here, we generalize a previous method (“eigendistortions” - Berardino et al, 2017) to compare models based on their metric tensors. Metric tensors characterize a model’s sensitivity to stimulus perturbations, reflecting both the geometric and stochastic properties of the representation, and providing an explicit prediction of perceptual discriminability. Brute force comparison of model-predicted metric tensors using human perceptual thresholds would require an impossibly large set of measurements, since one needs to perturb a stimulus in all possible orthogonal directions. To circumvent this “perceptual curse of dimensionality”, we compute and measure discrimination capabilities for a small set of most-informative perturbations, reducing the measurement cost from thousands of hours (a conservative estimate) to a single trial. We show that this single measurement, made for a variety of different test stimuli, is sufficient to differentiate models, select models that better match human perception, or generate new models that combine the advantages of both. We demonstrate the power of this method in assessing two examples: 1) comparing models for color discrimination; 2) comparing autoencoders trained with differentregularizers. |
Jingyang Zhou · Chanwoo Chun · Ajay Subramanian · Eero Simoncelli 🔗 |
-
|
Semi-Ensemble: A Simple Approach Over-parameterize Model Interpolation
(
Poster
)
link »
We develop a unified framework for interpolating two models with various degrees of over-parameterization, having model merging and model ensemble as special cases. Instead of directly interpolating models in their original parameter space, the proposed Semi-Ensemble interpolates the over-parameterized versions of the models in a higher-dimensional joint parameter space. Here, the over-parameterizations recover each endpoint model when projected to some low-dimensional subspace spanned by a fraction of bases. By carefully constructing the joint parameter space, the interpolated model can achieve a smooth tradeoff between the total number of parameters and the model accuracy, outperforming existing baselines. Intriguingly, we show that Semi-ensembles can sometimes achieve a better performance than vanilla ensembles, even with a slightly smaller number of parameters. |
Jiwoon Lee · Jaeho Lee 🔗 |
-
|
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
(
Poster
)
link »
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated web-scale image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3\% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification. |
Mohammadreza (Reza) Salehi · Mehrdad Farajtabar · Maxwell Horton · Fartash Faghri · Hadi Pouransari · Raviteja Vemulapalli · Oncel Tuzel · Ali Farhadi · Mohammad Rastegari · Sachin Mehta 🔗 |
-
|
Understanding Learning Dynamics of Neural Representations via Feature Visualization at Scale
(
Poster
)
link »
How does feature learning happen during the training of a neural network? We developed an accelerated pipeline to synthesize maximally activating images ("prototypes") for hidden units in a parallel fashion. Through this, we were able to perform feature visualization at scale, and to track the emergence and development of visual features across the training of neural networks. Using this technique, we studied the `developmental' process of features in a convolutional neural network trained from scratch using SimCLR with or without color jittering augmentation. After creating over one million prototypes with our method, tracking and comparing these visual signatures showed that the color-jittering augmentation led to constantly diversifying high-level features during training, while no color-jittering led to more diverse low-level features but less development of high-level features. These results illustrate how feature visualization can be used to understand training dynamics under different training objectives and data distribution. |
Chandana Kuntala · Deepak Sharma · Carlos Ponce · Binxu Wang 🔗 |
-
|
Unsupervised learning on spontaneous retinal activity leads to efficient neural representation geometry
(
Poster
)
link »
Prior to the onset of vision, neurons in the developing mammalian retina spontaneously fire in correlated activity patterns known as retinal waves. Experimental evidence suggests that retinal waves strongly influence the emergence of sensory representations before visual experience. We aim to model this early stage of functional development by using movies of neurally active developing retinas as pre-training data for neural networks. Specifically, we pre-train a ResNet-18 with an unsupervised contrastive learning objective (SimCLR) on both simulated and experimentally-obtained movies of retinal waves, then evaluate its performance on image classification tasks. We find that pre-training on retinal waves significantly improves performance on tasks that test object invariance to spatial translation, while slightly improving performance on more complex tasks like image classification. Notably, these performance boosts are realized on held-out natural images even though the pre-training procedure does not include any natural image data. We then propose a geometrical explanation for the increase in network performance, namely that the spatiotemporal characteristics of retinal waves facilitate the formation of separable feature representations. In particular, we demonstrate that networks pre-trained on retinal waves are more effective at separating image manifolds than randomly initialized networks, especially for manifolds defined by sets of spatial translations. These findings indicate that the broad spatiotemporal properties of retinal waves prepare networks for higher order feature extraction. |
Andrew Ligeralde · Yilun Kuang · Thomas Yerxa · Miah Pitcher · Marla Feller · SueYeon Chung 🔗 |
-
|
Estimating shape distances on neural representations with limited samples
(
Poster
)
link »
Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergence of standard estimators of shape distance—a measure of representational dissimilarity proposed by Williams et al. (2021). These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a novel method-of-moments estimator with a tunable bias-variance tradeoff parameterized by an upper bound on bias. We show that this estimator achieves superior performance to standard estimators in simulation and on neural data, particularly in high-dimensional settings. Our theoretical work and estimator thus respectively define and dramatically expand the scope of neural data for which geometric similarity can be accurately measured. |
Dean Pospisil · Brett Larsen · Sarah Harvey · Alex Williams 🔗 |
-
|
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks
(
Poster
)
link »
Recurrent neural networks (RNNs) trained on compositional tasks can exhibit functional modularity, in which neurons can be clustered by activity similarity and specialization on a shared computational subtask. Unlike brains, these RNNs do not exhibit anatomical modularity, in which functional clustering is correlated with strong recurrent coupling and spatial localization of functional clusters. Contrasting with functional modularity, which can be ephemerally dependent on the input, anatomically modular networks form a robust substrate for solving the same subtasks in the future. To examine whether it is possible to grow brain-like anatomical modularity, we apply a recent machine learning method, brain-inspired modular training (BIMT), to a network being trained to solve a set of compositional tasks. We find that functional and anatomical clustering emerge together, such that functionally similar neurons also become spatially localized and interconnected. Moreover, compared to standard $L_1$ regularization or no regularization settings, the model exhibits superior performance by optimally balancing task performance and network sparsity. In addition to achieving brain-like organization in RNNs, our findings also suggest that BIMT holds promise for applications in neuromorphic computing and enhancing the interpretability of neural network architectures.
|
Ziming Liu · Mikail Khona · Ila Fiete · Max Tegmark 🔗 |
-
|
Towards Measuring Representational Similarity of Large Language Models
(
Poster
)
link »
Understanding the similarity of the numerous large language models released has many uses, e.g., simplifying model selection, detecting illegal model reuse, and advancing our understanding of what makes LLMs perform well. In this work, we measure the similarity of representations of a set of LLMs with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions. |
Max Klabunde · Mehdi Ben Amor · Michael Granitzer · Florian Lemmerich 🔗 |
-
|
Bio-inspired parameter reuse: Exploiting inter-frame representation similarity with recurrence for accelerating temporal visual processing
(
Poster
)
link »
Feedforward neural networks are the dominant approach in current computer vision research. They typically do not incorporate recurrence, which is a prominent feature of biological vision brain circuitry. Inspired by biological findings, we introduce $\textbf{RecSlowFast}$, a recurrent slow-fast framework aimed at showing how recurrence can be useful for temporal visual processing. We perform a variable number of recurrent steps of certain layers in a network receiving input video frames, where each recurrent step is equivalent to a feedforward layer with weights reuse. By harnessing the hidden states extracted from the previous input frame, we reduce the computation cost by executing fewer recurrent steps on temporally correlated consecutive frames, while keeping good task accuracy. The early termination of the recurrence can be dynamically determined through newly introduced criteria based on the distance between hidden states and without using any auxiliary scheduler network. RecSlowFast $\textbf{reuses a single set of parameters}$, unlike previous work which requires one computationally heavy network and one light network, to achieve the speed versus accuracy trade-off. Using a new $\textit{Temporal Pathfinder}$ dataset proposed in this work, we evaluate RecSlowFast on a task to continuously detect the longest evolving contour in a video. The slow-fast inference mechanism speeds up the average frame per second by 279% on this dataset with comparable task accuracy using a desktop GPU. We further demonstrate a similar trend on CamVid, a video semantic segmentation dataset.
|
Zuowen Wang · Longbiao Cheng · Joachim Ott · Pehuen Moure · Shih-Chii Liu 🔗 |
-
|
UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification
(
Poster
)
link »
Multimodal Re-Identification (ReID) is a popular retrieval task that aims to re-identify objects across diverse data streams, prompting many researchers to integrate multiple modalities into a unified representation. While such fusion promises a holistic view, our investigations shed light on potential pitfalls. We uncover that prevailing late-fusion techniques often produce suboptimal latent representations when compared to methods that train modalities in isolation. We argue that this effect is largely due to the inadvertent relaxation of the training objectives on individual modalities when using fusion, what others have termed modality laziness. We present a nuanced point-of-view that this relaxation can lead to certain modalities failing to fully harness available task-relevant information, and yet, offers a protective veil to noisy modalities, preventing them from overfitting to task-irrelevant data. Our findings also show that unimodal concatenation (UniCat) and other late-fusion ensembling of unimodal backbones, when paired with best-known training techniques, exceed the current state-of-the-art performance across several multimodal ReID benchmarks. By unveiling the double-edged sword of "modality laziness", we motivate future research in balancing local modality strengths with global representations. |
Jennifer Crawford · Haoli Yin · Luke McDermott · Daniel Cummings 🔗 |
-
|
Testing Assumptions Underlying a Unified Theory for the Origin of Grid Cells
(
Poster
)
link »
Representing and reasoning about physical space is fundamental to animal survival, and the mammalian lineage expresses a wealth of specialized neural representations that encode space. Grid cells, whose discovery earned a Nobel prize, are a striking example: a grid cell is a neuron that fires if and only if the animal is spatially located at the vertices of a regular triangular lattice that tiles all explored two-dimensional environments. Significant theoretical work has gone into understanding why mammals have learned these particular representations, and recent work has proposed a ``unified theory for the computational and mechanistic origin of grid cells," claiming to answer why the mammalian lineage has learned grid cells. However, the Unified Theory makes a series of highly specific assumptions about the target readouts of grid cells - putatively place cells. In this work, we explicitly identify what these mathematical assumptions are, then test two of the critical assumptions using biological place cell data. At both the population and single-cell levels, we find evidence suggesting that neither of the assumptions are likely true in biological neural representations. These results call the Unified Theory into question, suggesting that biological grid cells likely have a different origin than those obtained in trained artificial neural networks. |
Rylan Schaeffer · Mikail Khona · Adrian Bertagnoli · Sanmi Koyejo · Ila Fiete 🔗 |
-
|
Understanding Mode Connectivity via Parameter Space Symmetry
(
Poster
)
link »
It has been observed that the global minimum of neural networks is connected by curves on which train and test loss is almost constant. This phenomenon, often referred to as mode connectivity, has inspired various applications such as model ensembling and fine-tuning. However, despite empirical evidence, a theoretical explanation is still lacking. We explore the connectedness of minimum through a new approach, parameter space symmetry. By relating topology of symmetry groups to topology of minima, we provide the number of connected components of full-rank linear networks. In particular, we show that skip connections reduce the number of connected components. We then prove mode connectivity up to permutation for linear networks. We also provide explicit expressions for connecting curves in minimum induced by symmetry. |
Bo Zhao · Nima Dehmamy · Robin Walters · Rose Yu 🔗 |
-
|
Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words
(
Poster
)
link »
Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. |
Yujia Bao · Srinivasan Sivanandan · Theofanis Karaletsos 🔗 |
-
|
DisCoV: Disentangling Time Series Representations via Contrastive based $l$-Variational Inference
(
Poster
)
link »
Learning disentangled representations is crucial for Time Series, offering benefits like feature derivation and improved interpretability, thereby enhancing task performance. We focus on disentangled representation learning for home appliance electricity usage, enabling users to understand and optimize their consumption for a reduced carbon footprint. Our approach frames the problem as disentangling each attribute's role in total consumption (e.g., dishwashers, fridges, \dots). Unlike existing methods assuming attribute independence, we acknowledge real-world time series attribute correlations, like the operating of dishwashers and washing machines during the winter season. To tackle this, we employ weakly supervised contrastive disentanglement, facilitating representation generalization across diverse correlated scenarios and new households. Our method utilizes innovative $l$-variational inference layers with self-attention, effectively addressing temporal dependencies across bottom-up and top-down networks. We find that DisCoV (Disentangling via Contrastive $l$-Variational) can enhance the task of reconstructing electricity consumption for individual appliances. We introduce TDS (Time Disentangling Score) to gauge disentanglement quality. TDS reliably reflects disentanglement performance, making it a valuable metric for evaluating time series representations. Code available at https://anonymous.4open.science/r/DisCo.
|
Khalid OUBLAL · Said Ladjal · David Benhaiem · Emmanuel LE BORGNE · François Roueff 🔗 |
-
|
Mixture of Multimodal Interaction Experts
(
Poster
)
link »
Multimodal machine learning, which studies the information and interactions across various input modalities, has made significant advancements in understanding the relationship between images and descriptive text. Yet, this is just a portion of the potential multimodal interactions in the real world, such as sarcasm in conflicting utterance and gestures. Notably, the current methods for capturing this shared information often don't extend well to these more nuanced interactions. Current models, in fact, show particular weaknesses with disagreement and synergistic interactions, sometimes performing as low as 50\% in binary classification. In this paper, we address this problem via a new approach called mixture of multimodal interaction experts. This method automatically classifies datapoints from unlabeled multimodal dataset by their intereaction types, then employs specialized models for each specific interaction. Based on our experiments, this approach has improved performance on these challenging interactions to more than 10%, leading to an overall increase of 2% for tasks like sarcasm prediction. As a result, not only does interaction quantification provide new insights for dataset analysis, but also simple approaches to obtain state-of-the-art performance. |
Haofei Yu · Paul Pu Liang · Russ Salakhutdinov · Louis-Philippe Morency 🔗 |
-
|
Inverted-Attention Transformers can Learn Object Representations: Insights from Slot Attention
(
Poster
)
link »
Visual reasoning is supported by a causal understanding of the physical world, and theories of human cognition suppose that a necessary step to causal understanding is the discovery and representation of high-level entities like objects. Slot Attention is a popular method aimed at object-centric learning, and its popularity has resulted in dozens of variants and extensions. To help understand the core assumptions that lead to successful object-centric learning, we take a step back and identify the minimal set of changes to a standard Transformer architecture to obtain the same performance as the specialized Slot Attention models. We systematically evaluate the performance and scaling behaviour of several ``intermediate'' architectures on seven image and video datasets from prior work. Our analysis reveals that by simply inverting the attention mechanism of Transformers, we obtain performance competitive with state-of-the-art Slot Attention in several domains. |
Yi-Fu Wu · Klaus Greff · Gamaleldin Elsayed · Michael Mozer · Thomas Kipf · Sjoerd van Steenkiste 🔗 |
-
|
Increasing Brain-LLM Alignment via Information-Theoretic Compression
(
Poster
)
link »
Recent work has discovered similarities between learned representations in large language models (LLMs) and human brain activity during language processing. However, it remains unclear what information LLM and brain representations share. In this work, inspired by a notion that brain data may include information not captured by LLMs, we apply an information bottleneck method to generate compressed representations of fMRI data. For certain brain regions in the frontal cortex, we find that compressing brain representations by a small amount increases their similarity to both BERT and GPT2 embeddings. Thus, our method not only improves LLM-brain alignment scores but also suggests important characteristics about the amount of information captured by each representation scheme. |
Mycal Tucker · Greta Tuckute 🔗 |
-
|
NoPose-NeuS: Jointly Optimizing Camera Poses with Neural Implicit Surfaces for Multi-view Reconstruction
(
Poster
)
link »
Learning neural implicit surfaces from volume rendering has become popular for multi-view reconstruction. Neural surface reconstruction approaches can recover complex 3D geometry that are difficult for classical Multi-view Stereo (MVS) approaches, such as non-Lambertian surfaces and thin structures. However, one key assumption for these methods is knowing accurate camera parameters for the input multi-view images, which are not always available. In this paper, we present NoPose-NeuS, a neural implicit surface reconstruction method that extends NeuS to jointly optimize camera poses with the geometry and color networks. We encode the camera poses as a multi-layer perceptron (MLP) and introduce two additional losses, which are multi-view feature consistency and rendered depth losses, to constrain the learned geometry for better estimated camera poses and scene surfaces. Extensive experiments on the DTU dataset show that the proposed method can estimate relatively accurate camera poses, while maintaining a high surface reconstruction quality with 0.89 mean Chamfer distance. |
Mohamed Sabae · Hoda Baraka · Mayada Hadhoud 🔗 |
-
|
Functional Modularity in Mind and Machine
(
Poster
)
link »
Modularity is a well established and foundational organisational principle of the brain. Neural modules are composed of neurons which are selective to particular sensory input or situations and tend to be organised in close proximity. Yet, establishing which neurons are coupled to implement a neural module is difficult to determine. Consequently, establishing the specifics of what exactly a neural module is selective for is also difficult. In both cases this is due to the difference between functional and architectural modularity. Architectural modularity results due to the explicit connections between neurons in a network. Thus, neurons which are connected form a module and the physical module can be probed to determine what it is selective for. Functional modularity, however, is only detectable in the behaviour of a subset of neurons in the network, but has no explicit pressure forcing its emergence outside of the learning algorithm interacting with the statistics of sensory experience. Thus, while we understand how broad regions of the brain are connected, more nuance is still required to obtain a better understanding of the degree of modularity. This problem is not limited, however, to biological neural networks, but artificial ones as well. ReLU networks for example have the ability to switch off regions of the hidden layer depending on the input data being presented. However, what each hidden neuron is selective for, which hidden neurons are functionally coupled and the meso-scale behaviour of the hidden layer is not well understood. In this work, we begin to understand the emergence and behaviour of functional neural modules in both ReLU and biological neural networks. We achieve this by drawing an equivalence between Gated Deep Linear Networks (GDLNs) and the respective networks by mapping from functional neural modules onto architectural modules of the GDLN. Through the lens of the GDLN we are able to obtain a number of insights for how information is distributed in artificial and biological brains to support context-sensitive controlled semantic cognition. |
Devon Jarvis · Richard Klein · Benjamin Rosman · Andrew Saxe 🔗 |
-
|
Are Spiking Neural Networks more expressive than Artificial Neural Networks?
(
Poster
)
link »
This article studies the expressive power of spiking neural networks with firing-time-based information encoding, highlighting their potential for future energy-efficient AI applications when deployed on neuromorphic hardware. The computational power of a network of spiking neurons has already been studied via their capability of approximating any continuous function. By using the Spike Response Model as a mathematical model of a spiking neuron and assuming a linear response function, we delve deeper into this analysis and prove that spiking neural networks generate continuous piecewise linear mappings. We also show that they can emulate any multi-layer (ReLU) neural network with similar complexity. Furthermore, we prove that the maximum number of linear regions generated by a spiking neuron scales exponentially with respect to the input dimension, a characteristic that distinguishes it significantly from an artificial (ReLU) neuron. Our results further extend the understanding of the approximation properties of spiking neural networks and open up new avenues where spiking neural networks can be deployed instead of artificial neural networks without any performance loss. |
Manjot Singh · Adalbert Fono · Gitta Kutyniok 🔗 |
-
|
Disentangling Linear Mode Connectivity
(
Poster
)
link »
Linear mode-connectivity (LMC) (or lack thereof) is one of the intriguing characteristics of neural network loss landscapes. While empirically well established, it unfortunately still lacks a proper theoretical understanding. Even worse, although empirical data points are abound, a systematic study of when networks exhibit LMC is largely missing in the literature. In this work we aim to close this gap. We explore how LMC is affected by three factors: (1) architecture (sparsity, weight-sharing), (2) training strategy (optimization setup) as well as (3) the underlying dataset. We place particular emphasis on minimal but non-trivial settings, removing as much unnecessary complexity as possible. We believe that our insights can guide future theoretical works on uncovering the inner workings of LMC. |
Gül Sena Altıntaş · Gregor Bachmann · Lorenzo Noci · Thomas Hofmann 🔗 |
-
|
Model Merging by Gradient Matching
(
Poster
)
link »
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. |
Nico Daheim · Thomas Möllenhoff · Edoardo Maria Ponti · Iryna Gurevych · Mohammad Emtiyaz Khan 🔗 |
-
|
Object-Centric Semantic Vector Quantization
(
Poster
)
link »
Neural discrete representations are crucial components of modern neural networks. However, their main limitation is that the primary strategies such as VQ-VAE can only provide representations at the patch level. Therefore, one of the main goals of representation learning, acquiring conceptual, semantic, and compositional abstractions such as the color and shape of an object, remains elusive. In this paper, we present the first approach to semantic neural discrete representation learning. The proposed model, called Semantic Vector-Quantized Variational Autoencoder (SVQ), leverages recent advances in unsupervised object-centric learning to address this limitation. Specifically, we observe that a simple approach quantizing at the object level poses a significant challenge and propose constructing scene representations hierarchically, from low-level discrete concept schemas to object representations. Additionally, we suggest a novel method for training a prior over these semantic representations, enabling the ability to generate images following the underlying data distribution, which is lacking in most object-centric models. In experiments on various 2D and 3D object-centric datasets, we find that our model achieves superior generation performance compared to non-semantic vector quantization methods such as VQ-VAE and previous object-centric generative models. Furthermore, we find that the semantic discrete representations can solve downstream scene understanding tasks that require reasoning about the properties of different objects in the scene. |
Yi-Fu Wu · Minseung Lee · Sungjin Ahn 🔗 |
-
|
Evaluation of Representational Similarity Scores Across Human Visual Cortex
(
Poster
)
link »
We investigate several popular methods for quantifying the similarity between neural representations applied to a large-scale fMRI dataset of human ventral visual cortex. We focus on representational geometry as a framework for comparing various functionally-defined high-level regions of interest (ROIs) in the ventral stream. We benchmark Representational Similarity Analysis, Centered Kernel Alignment, and Generalized Shape Metrics. We explore how well the geometry implied by pairwise representational dissimilarity scores produced by each method matches the 2D anatomical geometry of visual cortex. Our results suggest that while these methods yield similar outcomes, Shape Metrics provide distances between representations whose relation to the anatomical geometry is most invariant across subjects. Our work establishes a criterion with which to compare methods for quantifying representational similarity with implications for studying the anatomical organization of high-level ventral visual cortex. |
Francisco Acosta · Colin Conwell · David Klindt · Nina Miolane 🔗 |
-
|
Supervising Variational Autoencoder Latent Representations with Language
(
Poster
)
link »
Supervising latent representations of data is of great interest for modern multi-modal generative machine learning. In this work, we propose two new methods to use text to condition the latent representations of a VAE, and evaluate them on a novel conditional image-generation benchmark task. We find that the applied methods can be used to generate highly accurate reconstructed images through language querying with minimal compute resources. Our methods are quantitatively successful at conforming to textually-supervised attributes of an image while keeping unsupervised attributes constant. At large, we present critical observations on disentanglement between supervised and unsupervised properties of images and identify common barriers to effective disentanglement. |
Thomas Lu · Aboli Marathe · Ada Martin 🔗 |
-
|
Implicit Representations for Image Segmentation
(
Poster
)
link »
Image segmentation has greatly advanced over the past ten years. Yet, even the most recent techniques face difficulties producing good results in challenging situations, e.g., if training data are scarce, out-of-distribution examples need to be segmented, or if objects are occluded. In such situations, the inclusion of (geometric) constraints can improve the segmentation quality significantly. In this paper, we study the constraint of the segmented region being segmented convex. Unlike prior work that encourages such a property with computationally expensive penalties on segmentation masks represented explicitly on a grid of pixels, our work is the first to consider an implicit representation. Specifically, we represent the segmentation as a parameterized function that maps spatial coordinates to the likeliness of a pixel belonging to the fore- or background. This enables us to provably ensure the convexity of the segmented regions with the help of input convex neural networks. Numerical experiments demonstrate how to encourage explicit and implicit representations to match in order to benefit from the convexity constraints in several challenging segmentation scenarios. |
Jan Philipp Schneider · Mishal Fatima · Jovita Lukasik · Andreas Kolb · Margret Keuper · Michael Moeller 🔗 |
-
|
Ecological data and objectives align deep neural network representations with humans
(
Poster
)
link »
The many successes of deep neural networks (DNNs) over the past decade have largely been driven by computational scale rather than insights from biological intelligence. While DNNs have nevertheless been surprisingly adept at explaining behavioral and neural recordings from humans, there is a growing number of reports indicating that DNNs are becoming progressively worse models of human vision as they improve on standard computer vision benchmarks. Here, we provide evidence that one path towards improving the alignment of DNNs with human vision is to train them with data and objective functions that more closely resemble those relied on by brains. We find that DNNs trained to capture the causal structure of large spatiotemporal object datasets learn generalizable object representations that exhibit smooth equivariance to 3-Dimensional (out-of-plane) variations in object pose and are predictive of human decisions and reaction times on popular psychophysics stimuli. Our work identifies novel data diets and objective functions that better align DNN vision with humans and can be easily scaled to generate the next generation of DNNs that behave as humans do. |
Akash Nagaraj · Alekh Karkada Ashok · Drew Linsley · Francis Lewis · Peisen Zhou · Thomas Serre 🔗 |
-
|
Visual Expertise Explains Image Inversion Effects
(
Poster
)
link »
We present an anatomically-inspired neurocomputational model, including a foveated retina and the log-polar mapping from the visual field to the primary visual cortex, that recreates image inversion effects long seen in psychophysical studies. We show that visual expertise, the ability to discriminate between subordinate-level categories, changes the performance of the model on inverted images. We first explore face discrimination, which, in humans, relies on configural information. The log-polar transform disrupts configural information in an inverted image and leaves featural information relatively unaffected. We suggest this is responsible for the degradation of performance with inverted faces. We then recreate the effect with other subordinate-level category discriminators and show that the inversion effect arises as a result of visual expertise, where configural information becomes relevant as more identities are learned at the subordinate-level. Our model matches the classic result: faces suffer more from inversion than mono-oriented objects, which are more disrupted than non-mono-oriented objects when objects are only familiar at a basic-level, and simultaneously shows that expert-level discrimination of other subordinate-level categories respond similarly to inversion as face experts. |
Martha Gahl · Shubham Kulkarni · Nikhil Pathak · Alex Russell · Gary Cottrell 🔗 |
-
|
How Good is a Single Basin?
(
Poster
)
link »
The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporating the knowledge from other basins implicitly through distillation, we show that the gap in performance can be mitigated by re-discovering (multi-basin) deep ensembles in a single basin. Thus, we conjecture that while the extra-basin knowledge is at least partially present in any given basin, it cannot be easily harnessed without learning it from other basins. |
Kai Lion · Gregor Bachmann · Lorenzo Noci · Thomas Hofmann 🔗 |
-
|
Subjective Randomness and In-Context Learning
(
Poster
)
link »
Large language models (LLMs) exhibit intricate capabilities, often achieving high performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often unclear, with different prompts eliciting different capabilities, especially when used with in-context learning (ICL). We propose a "Cognitive Interpretability" framework that enables us to analyze ICL dynamics to understand latent concepts underlying LLMs' behavioral patterns. This provides a more nuanced understanding than posthoc evaluation benchmarks, but does not require observing model internals as a mechanistic interpretation would require. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of ICL by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking ICL dynamics where model outputs transition sharply from pseudo-random behaviors to deterministic repetition. |
Eric Bigelow · Ekdeep S Lubana · Robert Dick · Hidenori Tanaka · Tomer Ullman 🔗 |
-
|
Deep Multimodal Emotion Recognition using Modality Aware Attention Network for Unifying Representations in Neural Models
(
Poster
)
link »
This paper introduces a multi-modal emotion recognition system aimed at enhancing emotion recognition by integrating representations from physiological signals. To accomplish this goal, we introduce a modality aware attention network to extract emotion-specific features by influencing and aligning the representation spaces of various modalities into a unified entity. Through a series of experiments and visualizations conducted on the AMIGO dataset, we demonstrate the efficacy of our proposed methodology for emotion classification, highlighting its capability to provide comprehensive representations of physiological signals. |
Sungpil Woo · MUHAMMAD ZUBAIR · Sunhwan Lim · Daeyoung Kim 🔗 |
-
|
On Feature Learning of Recursive Feature Machines and Automatic Relevance Determination
(
Poster
)
link »
Feature learning is a crucial element for the performance of machine learning models. Recently, the exploration of feature learning in the context of kernel methods has led to the introduction of Recursive Feature Machines (RFMs). In this work, we connect diagonal RFMs to Automatic Relevance Determination (ARD) from the Gaussian process literature. We demonstrate that diagonal RFMs, similar to ARD, serve as a weighted covariate selection technique. However, they are trained using different paradigms: RFMs use recursive iterations of the so-called Average Gradient Outer Product, while ARD employs maximum likelihood estimation. Our experiments show that while the learned features in both models correlate highly across various tabular datasets, this correlation is lower for other datasets. Furthermore, we demonstrate that the RFM effectively captures correlation between covariates, and we present instances where the RFM outperforms both ARD and diagonal RFM. |
Daniel Gedon · Amirhesam Abedsoltan · Thomas Schön · Misha Belkin 🔗 |
-
|
Revisiting Supervision for Continual Representation Learning
(
Poster
)
link »
In the field of continual learning, models are designed to learn tasks one after the other. While most research has centered on supervised continual learning, recent studies have highlighted the strengths of self-supervised continual representation learning. The improved transferability of representations built with self-supervised methods is often associated with the role played by the multi-layer perceptron projector. In this work, we depart from this observation and reexamine the role of supervision in continual representation learning. We reckon that additional information, such as human annotations, should not deteriorate the quality of representations. Our findings show that supervised models when enhanced with a multi-layer perceptron head, can outperform self-supervised models in continual representation learning. |
Daniel Marczak · Sebastian Cygert · Tomasz Trzcinski · Bartłomiej Twardowski 🔗 |
-
|
Role Taxonomy of Units in Deep Neural Networks
(
Poster
)
link »
Identifying the role of network units in deep neural networks (DNNs) is critical in many aspects including giving understandings on the mechanisms of DNNs and building basic connections between deep learning and neuroscience. However, there remains unclear on which roles the units in DNNs with different generalization ability could present. To this end, we give role taxonomy of units in DNNs, where units are categorized into four types in terms of their functional preference on separately the training set and testing set. We show that ratios of the four categories are highly associated with the generalization ability of DNNs from two distinct perspectives, based on which we give signs of DNNs with well generalization. |
Yang Zhao · Hao Zhang · Xiuyuan Hu 🔗 |
-
|
A sparse null code emerges in deep neural networks
(
Poster
)
link »
The internal representations of deep vision models are often assumed to encode specific image features, such as contours, textures, and object parts. However, it is possible for deep networks to learn highly abstract representations that may not be linked to any specific image feature. Here we present evidence for one such abstract representation in transformers and modern convolutional architectures that appears to serve as a null code, indicating image regions that are non-diagnostic for the object class. These null codes are both statistically and qualitatively distinct from the more commonly reported feature-related codes of vision models. Specifically, these null codes have several distinct characteristics: they are highly sparse, they have a single unique activation pattern for each network, they emerge abruptly at intermediate network depths, and they are activated in a feature-independent manner by weakly informative image regions, such as backgrounds. Together, these findings reveal a new class of highly abstract representations in deep vision models: sparse null codes that seem to indicate the absence of relevant features. |
Brian Robinson · Nathan Drenkow · Colin Conwell · Michael Bonner 🔗 |
-
|
How does fine-tuning affect your model? Mechanistic analysis on procedural tasks
(
Poster
)
link »
Fine-tuning large pre-trained models has become the de facto strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task. |
Samyak Jain · Robert Kirk · Ekdeep S Lubana · Robert Dick · Hidenori Tanaka · Tim Rocktäschel · Edward Grefenstette · David Krueger 🔗 |
-
|
Variational Classification
(
Poster
)
link »
We present variational classification (VC), a latent variable generalisation of neural network softmax classification under cross-entropy loss. Our approach provides a novel probabilistic interpretation of the highly familiar softmax classification model, to which it relates comparably to variational vs deterministic autoencoders. We derive a training objective based on the evidence lower bound (ELBO) that is non-trivial to optimize, and an adversarial approach to maximise it. We reveal an inherent inconsistency within softmax classification that VC addresses, while also allowing flexible choices of distributions in the latent space in place of assumptions implicit in standard softmax classifiers. Empirical evaluation demonstrates that VC maintains accuracy while improving properties such as calibration and adversarial robustness, particularly under distribution shift and low data settings. By explicitly considering representations learned by supervised methods, we offer the prospect of the principled merging of supervised learning with other representation learning methods, e.g.\ contrastive learning, using a common encoder architecture. |
Shehzaad Dhuliawala · Mrinmaya Sachan · Carl Allen 🔗 |
-
|
Hybrid Early Fusion for Multi-Modal Biomedical Representations
(
Poster
)
link »
Technological advances in medical data collection such as high-resolution histopathology and high-throughput genomic sequencing have contributed to the rising requirement for multi-modal biomedical modelling, specifically for image, tabular, and graph data. Most multi-modal deep learning approaches use modality-specific architectures that are trained separately and cannot capture the crucial cross-modal information that motivates the integration of different data sources. This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet) – a flexible multi-modal fusion architecture, which a) preserves modality-specific structural information, b) captures the cross-modal interactions and structural information in a shared latent space, c) can effectively handle missing modalities during training and inference, and d) enables intuitive model inspection by learning on the raw data input instead of opaque embeddings. We conduct multi-modal survival analysis on Whole Slide Images and Multi-omic data on four cancer cohorts of The Cancer Genome Atlas (TCGA). HEALNet achieves state-of-the-art performance, substantially improving over both uni-modal and recent multi-modal baselines, whilst being robust in scenarios with missing modalities. |
Konstantin Hemker · Nikola Simidjievski · Mateja Jamnik 🔗 |
-
|
Randomly Weighted Neuromodulation in Neural Networks Facilitates Learning of Manifolds Common Across Tasks
(
Poster
)
link »
Geometric Sensitive Hashing functions, a family of Local Sensitive Hashing functions, are neural network models that learn class-specific manifold geometry in supervised learning. However, given a set of supervised learning tasks, understanding the manifold geometries that can represent each task and the kinds of relationships between the tasks based on them has received little attention. We explore a formalization of this question by considering a generative process where each task is associated with a high-dimensional manifold, which can be done in brain-like models with neuromodulatory systems. Following this formulation, we define Task-specific Geometric Sensitive Hashing and show that a randomly weighted neural network with a neuromodulation system can realize this function. |
Jinyung Hong · Theodore P. Pavlic 🔗 |
-
|
Linear Mode Connectivity in Sparse Neural Networks
(
Poster
)
link »
With the rise in interest of sparse neural networks, we study how neural network pruning with synthetic data leads to sparse networks with unique training properties. We find that distilled data, a synthetic summarization of the real data, paired with Iterative Magnitude Pruning (IMP) unveils a new class of sparse networks that are more stable to SGD noise on the real data, than either the dense model, or subnetworks found with real data in IMP. That is, synthetically chosen subnetworks often train to the same minima, or exhibit linear mode connectivity. We study this through linear interpolation, loss landscape visualizations, and measuring the diagonal of the hessian. While dataset distillation as a field is still young, we find that these properties lead to synthetic subnetworks matching the performance of traditional IMP with up to 150x less training points in settings where distilled data applies. |
Luke McDermott · Daniel Cummings 🔗 |
-
|
SimVAE: Narrowing the gap between Discriminative & Generative Representation Learning
(
Poster
)
link »
Self-supervised representation learning is a powerful paradigm that leverages the relationship between semantically similar data, such as augmentations, extracts of an image or sound clip, or multiple views/modalities. Recent methods, e.g. SimCLR, CLIP and DINO, have made significant strides, yielding representations that achieve state-of-the-art results on multiple downstream tasks. Though often intuitive, a comprehensive theoretical understanding of their underlying mechanisms or what they learn eludes. Meanwhile, generative approaches, such as variational autoencoders (VAEs), fit a specific latent variable model and have principled appeal, but lag significantly in terms of performance. We present a theoretical analysis of self-supervised discriminative methods and a graphical model that reflects the assumptions they implicitly make and unifies these methods.We show that fitting this model under an ELBO objective improves representations over previous VAE methods on several common benchmarks, narrowing the gap to discriminative methods, and can also preserve information lost by discriminative approaches. This work brings new theoretical insight to modern machine learning practice. |
Alice Bizeul · Carl Allen 🔗 |
-
|
Distributional Reinforcement Learning in the Mammalian Brain
(
Poster
)
link »
Distributional reinforcement learning (dRL) — learning to predict not just the average return but the entire probability distribution of returns — has achieved impressive performance across a wide range of benchmark machine learning tasks. In vertebrates, the basal ganglia strongly encodes mean value and has long been thought to implement RL, but little is known about whether, where, and how populations of neurons in this circuit encode information about higher-order moments of reward distributions. To fill this gap, we used Neuropixels probes to acutely record striatal activity from well-trained, water-restricted mice performing a classical conditioning task. Across several measures of representational distance, odors associated with the same reward distribution were encoded more similarly to one another than to odors associated with the same mean reward but different reward variance, as predicted by dRL but not traditional RL. Optogenetic manipulations and computational modeling suggested that genetically distinct populations of neurons encoded the left and right tails of these distributions. Together, these results reveal a remarkable degree of convergence between dRL and the mammalian brain and hint at further biological specializations of the same overarching algorithm. |
Adam Lowet · Qiao Zheng · Melissa Meng · Sara Matias · Jan Drugowitsch · Naoshige Uchida 🔗 |
-
|
Soft Matching Distance: A metric on neural representations that captures single-neuron tuning
(
Poster
)
link »
Common measures of neural representational (dis)similarity are designed to be insensitive to rotations and reflections of the neural activation space. Motivated by the premise that the tuning of individual units may be important, there has been recent interest in developing stricter notions of representational (dis)similarity that require neurons to be individually matched across networks. When two networks have the same size (i.e. same number of neurons), a distance metric can be formulated by optimizing over neuron index permutations to maximize tuning curve alignment. However, it is not clear how to generalize this metric to measure distances between networks with different sizes. Here, we leverage a connection to optimal transport theory to derive a natural generalization based on ``soft'' permutations. The resulting metric is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions. Further, our proposed metric avoids counter-intuitive outcomes suffered by alternative approaches, and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics. |
Meenakshi Khosla · Alex Williams 🔗 |
-
|
Universality of intrinsic dimension of latent representations across models
(
Poster
)
link »
While state-of-the-art transformer networks use several hundreds of latent variables per layer, it has been shown that these features can actually be represented by relatively low dimensional manifolds. The intrinsic dimension is a geometrical property of the manifold latent representations populate, viz., a minimal number of parameters needed to describe the representations. In this work, we compare the intrinsic dimensions of three image transformer networks for classes of the cifar10 and cifar100 dataset. We find compelling evidence that the intrinsic dimensions differ among classes but are universal across networks. This universality persists across different pretraining strategies, fine-tuning and different model sizes. Our results strengthen the hypothesis that different models learn similar representations of data and show great potential that further investigation of intrinsic dimension could lead to more insights on the universality of latent representations. |
Teresa Scheidt · Lars Kai Hansen 🔗 |
-
|
On consequences of finetuning on data with highly discriminative features
(
Poster
)
link »
In the era of transfer learning, training neural networks from scratch is becoming obsolete. Transfer learning leverages prior knowledge for new tasks, conserving computational resources. While its advantages are well-documented, we uncover a notable drawback: networks tend to prioritize basic data patterns, forsaking valuable pre-learned features. We term this behavior "feature erosion" and analyze its impact on network performance and internal representations. |
Wojciech Masarczyk · Tomasz Trzcinski · Mateusz Ostaszewski 🔗 |
-
|
An Information-Theoretic Understanding of Maximum Manifold Capacity Representations
(
Poster
)
link »
Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is interesting for at least two reasons. Firstly, MMCR is an oddity in the zoo of MVSSL methods: it is not (explicitly) contrastive, applies no masking, performs no clustering, leverages no distillation, and does not (explicitly) reduce redundancy. Secondly, while many self-supervised learning (SSL) methods originate in information theory, MMCR distinguishes itself by claiming a different origin: a statistical mechanical characterization of the geometry of linear separability of data manifolds. However, given the rich connections between statistical mechanics and information theory, and given recent work showing how many SSL methods can be understood from an information-theoretic perspective, we conjecture that MMCR can be similarly understood from an information-theoretic perspective. In this paper, we leverage tools from high dimensional probability and information theory to demonstrate that an optimal solution to MMCR's nuclear norm-based objective function is the same optimal solution that maximizes a well-known lower bound on mutual information. |
Berivan Isik · Victor Lecomte · Rylan Schaeffer · Yann LeCun · Mikail Khona · Ravid Shwartz-Ziv · Sanmi Koyejo · Andrey Gromov 🔗 |
-
|
SHARCS: Shared Concept Space for\\Explainable Multimodal Learning
(
Poster
)
link »
Multimodal learning is an essential paradigm for addressing complex real-world problems, where individual data modalities are typically insufficient for accurately solving a given modelling task. While various deep learning approaches have successfully addressed these challenges, their reasoning process is often opaque; limiting the capabilities for a principled explainable cross-modal analysis and any domain-expert intervention. In this paper, we introduce SHARCS (SHARed Concept Space) -- a novel concept-based approach for explainable multimodal learning. SHARCS learns and maps interpretable concepts from different heterogeneous modalities into a single unified concept-manifold, which leads to an intuitive projection of semantically similar cross-modal concepts. We demonstrate that such an approach can lead to inherently explainable task predictions while also improving downstream predictive performance. Moreover, we show that SHARCS can operate and significantly outperform other approaches in practically significant scenarios, such as retrieval of missing modalities and cross-modal explanations. Our approach is model agnostic and easily applicable to different types (and number) of modalities, thus advancing the development of effective, interpretable, and trustworthy multimodal approaches. |
Gabriele Dominici · Pietro Barbiero · Lucie Charlotte Magister · Pietro Lió · Nikola Simidjievski 🔗 |
-
|
Sufficient conditions for offline reactivation in recurrent neural networks
(
Poster
)
link »
During periods of quiescence, such as sleep, neural activity in many brain circuits resembles that observed during periods of task engagement. However, the precise conditions under which task-optimized networks can autonomously reactivate the same network states responsible for online behavior is poorly understood. In this study, we develop a mathematical framework that outlines sufficient conditions for the emergence of neural reactivation in circuits that encode features of smoothly varying stimuli. We demonstrate mathematically that noisy recurrent networks optimized to track environmental state variables using change-based sensory information naturally develop denoising dynamics, which, in the absence of input, cause the network to revisit state configurations observed during periods of online activity. We validate our findings using numerical experiments on two canonical neuroscience tasks: spatial position estimation based on self-motion cues, and head direction estimation based on angular velocity cues. Overall, our work provides theoretical support for modeling offline reactivation as an emergent consequence of task optimization in noisy neural circuits. |
Nanda Krishna · Colin Bredenberg · Daniel Levenstein · Blake Richards · Guillaume Lajoie 🔗 |
-
|
On the universality of neural codes in vision
(
Poster
)
link »
A high level of similarity between neural codes of natural images has been reported for both biological and artificial brains. These observations beg the question whether this similarity of representations stems from a more fundamental similarity between neural coding strategies. In this paper, we show that neural networks trained on different image classification datasets learn similar weight summary statistics. Our results reveal the existence of a universal neural code for natural images. |
Florentin Guth · Brice Ménard 🔗 |
-
|
Event-Based Contrastive Learning for Medical Time Series
(
Poster
)
link »
In clinical practice, one often needs to identify whether a patient is at high risk of adverse outcomes after some key medical event; e.g., the short-term risk of death after an admission for heart failure. This task, however, remains challenging due to the complexity, variability, and heterogeneity of longitudinal medical data, especially for individuals suffering from chronic diseases like heart failure. In this paper, we introduce Event-Based Contrastive Learning (EBCL) - a method for learning embeddings of heterogeneous patient data that preserves temporal information before and after key index events. We demonstrate that EBCL produces models that yield better fine-tuning performance on critical downstream tasks including 30-day readmission, 1-year mortality, and 1-week length of stay relative to other representation learning methods that do not exploit temporal information surrounding key medical events. |
Hyewon Jeong · Nassim Oufattole · Aparna Balagopalan · Matthew McDermott · Payal Chandak · Marzyeh Ghassemi · Collin Stultz 🔗 |
-
|
Multi-timescale reinforcement learning in the brain
(
Poster
)
link »
To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behavior can be learned through reinforcement learning1, a class of algorithms that has been successful at training artificial agents and at characterizing the firing of dopamine neurons in the midbrain. In classical reinforcement learning, agents discount future rewards exponentially according to a single time scale, known as the discount factor. This strategy is at the odds with the empirical observation that humans and animals use non-exponential discounts in many situations. Here, we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopamine neurons in mice performing two behavioral tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks suggesting that it is a cell-specific property. Together, our results provide a new paradigm to understand functional heterogeneity in dopamine neurons, and open new avenues for the design of more efficient reinforcement learning algorithms. |
Paul Masset · Pablo Tano · HyungGoo Kim · Athar Malik · Alexandre Pouget · Naoshige Uchida 🔗 |
-
|
Duality of Bures and Shape Distances with Implications for Comparing Neural Representations
(
Spotlight
)
link »
A multitude of (dis)similarity measures between neural networks representations have been proposed, resulting in a fragmented research landscape. Most (dis)similarity measures fall into one of two categories. First, measures such as linear regression, canonical correlations analysis (CCA), and shape distances, all learn explicit mappings between neural units to quantify similarity while accounting for expected invariances. Second, measures such as representational similarity analysis (RSA), centered kernel alignment (CKA), and normalized Bures similarity (NBS) all quantify similarity in summary statistics that are already invariant to such symmetries (e.g. by comparing stimulus-by-stimulus kernel matrices). Here, we take steps towards unifying these two broad categories of methods by observing that the cosine of the Riemannian shape distance (from category 1) is equal to NBS (from category 2). We explore how this connection leads to new interpretations of shape distances and NBS, and draw contrasts of these measures with CKA, a popular similarity measure in the deep learning literature. |
Sarah Harvey · Brett Larsen · Alex Williams 🔗 |
-
|
ZipIt!: Multitask Model Merging without Training
(
Spotlight
)
link »
We tackle the extremely difficult problem of combining distinct models with different initializations, each solving a separate task, into one multi-task model $\textbf{without any additional training}$. Prior work in model merging permutes one model to the space of the other then averages them together. While this works for models trained on the same task, we find that this fails to account for the differences in models trained on disjoint tasks. Thus, we introduce "ZipIt!", a general method for merging two arbitrary models of the same architecture that incorporates two simple strategies. First, in order to account for features that aren't shared between models, we expand the model merging problem to allow for merging features $\textit{within}$ each model by defining a general "zip" operation. Second, we add support for $\textit{partially zipping}$ the models up until a specified layer, naturally creating a multi-head model. We find that these two changes combined account for a staggering 20-50% improvement over prior work,
|
George Stoica · Daniel Bolya · Jakob Bjorner · Pratik Ramesh · Taylor Hearn · Judy Hoffman 🔗 |
-
|
Are Spiking Neural Networks more expressive than Artificial Neural Networks?
(
Spotlight
)
link »
This article studies the expressive power of spiking neural networks with firing-time-based information encoding, highlighting their potential for future energy-efficient AI applications when deployed on neuromorphic hardware. The computational power of a network of spiking neurons has already been studied via their capability of approximating any continuous function. By using the Spike Response Model as a mathematical model of a spiking neuron and assuming a linear response function, we delve deeper into this analysis and prove that spiking neural networks generate continuous piecewise linear mappings. We also show that they can emulate any multi-layer (ReLU) neural network with similar complexity. Furthermore, we prove that the maximum number of linear regions generated by a spiking neuron scales exponentially with respect to the input dimension, a characteristic that distinguishes it significantly from an artificial (ReLU) neuron. Our results further extend the understanding of the approximation properties of spiking neural networks and open up new avenues where spiking neural networks can be deployed instead of artificial neural networks without any performance loss. |
Manjot Singh · Adalbert Fono · Gitta Kutyniok 🔗 |
-
|
A sparse null code emerges in deep neural networks
(
Spotlight
)
link »
The internal representations of deep vision models are often assumed to encode specific image features, such as contours, textures, and object parts. However, it is possible for deep networks to learn highly abstract representations that may not be linked to any specific image feature. Here we present evidence for one such abstract representation in transformers and modern convolutional architectures that appears to serve as a null code, indicating image regions that are non-diagnostic for the object class. These null codes are both statistically and qualitatively distinct from the more commonly reported feature-related codes of vision models. Specifically, these null codes have several distinct characteristics: they are highly sparse, they have a single unique activation pattern for each network, they emerge abruptly at intermediate network depths, and they are activated in a feature-independent manner by weakly informative image regions, such as backgrounds. Together, these findings reveal a new class of highly abstract representations in deep vision models: sparse null codes that seem to indicate the absence of relevant features. |
Brian Robinson · Nathan Drenkow · Colin Conwell · Michael Bonner 🔗 |
-
|
Randomly Weighted Neuromodulation in Neural Networks Facilitates Learning of Manifolds Common Across Tasks
(
Spotlight
)
link »
Geometric Sensitive Hashing functions, a family of Local Sensitive Hashing functions, are neural network models that learn class-specific manifold geometry in supervised learning. However, given a set of supervised learning tasks, understanding the manifold geometries that can represent each task and the kinds of relationships between the tasks based on them has received little attention. We explore a formalization of this question by considering a generative process where each task is associated with a high-dimensional manifold, which can be done in brain-like models with neuromodulatory systems. Following this formulation, we define Task-specific Geometric Sensitive Hashing and show that a randomly weighted neural network with a neuromodulation system can realize this function. |
Jinyung Hong · Theodore P. Pavlic 🔗 |
-
|
Distributional Reinforcement Learning in the Mammalian Brain
(
Spotlight
)
link »
Distributional reinforcement learning (dRL) — learning to predict not just the average return but the entire probability distribution of returns — has achieved impressive performance across a wide range of benchmark machine learning tasks. In vertebrates, the basal ganglia strongly encodes mean value and has long been thought to implement RL, but little is known about whether, where, and how populations of neurons in this circuit encode information about higher-order moments of reward distributions. To fill this gap, we used Neuropixels probes to acutely record striatal activity from well-trained, water-restricted mice performing a classical conditioning task. Across several measures of representational distance, odors associated with the same reward distribution were encoded more similarly to one another than to odors associated with the same mean reward but different reward variance, as predicted by dRL but not traditional RL. Optogenetic manipulations and computational modeling suggested that genetically distinct populations of neurons encoded the left and right tails of these distributions. Together, these results reveal a remarkable degree of convergence between dRL and the mammalian brain and hint at further biological specializations of the same overarching algorithm. |
Adam Lowet · Qiao Zheng · Melissa Meng · Sara Matias · Jan Drugowitsch · Naoshige Uchida 🔗 |
-
|
Soft Matching Distance: A metric on neural representations that captures single-neuron tuning
(
Spotlight
)
link »
Common measures of neural representational (dis)similarity are designed to be insensitive to rotations and reflections of the neural activation space. Motivated by the premise that the tuning of individual units may be important, there has been recent interest in developing stricter notions of representational (dis)similarity that require neurons to be individually matched across networks. When two networks have the same size (i.e. same number of neurons), a distance metric can be formulated by optimizing over neuron index permutations to maximize tuning curve alignment. However, it is not clear how to generalize this metric to measure distances between networks with different sizes. Here, we leverage a connection to optimal transport theory to derive a natural generalization based on ``soft'' permutations. The resulting metric is symmetric, satisfies the triangle inequality, and can be interpreted as a Wasserstein distance between two empirical distributions. Further, our proposed metric avoids counter-intuitive outcomes suffered by alternative approaches, and captures complementary geometric insights into neural representations that are entirely missed by rotation-invariant metrics. |
Meenakshi Khosla · Alex Williams 🔗 |
-
|
An Information-Theoretic Understanding of Maximum Manifold Capacity Representations
(
Spotlight
)
link »
Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is interesting for at least two reasons. Firstly, MMCR is an oddity in the zoo of MVSSL methods: it is not (explicitly) contrastive, applies no masking, performs no clustering, leverages no distillation, and does not (explicitly) reduce redundancy. Secondly, while many self-supervised learning (SSL) methods originate in information theory, MMCR distinguishes itself by claiming a different origin: a statistical mechanical characterization of the geometry of linear separability of data manifolds. However, given the rich connections between statistical mechanics and information theory, and given recent work showing how many SSL methods can be understood from an information-theoretic perspective, we conjecture that MMCR can be similarly understood from an information-theoretic perspective. In this paper, we leverage tools from high dimensional probability and information theory to demonstrate that an optimal solution to MMCR's nuclear norm-based objective function is the same optimal solution that maximizes a well-known lower bound on mutual information. |
Berivan Isik · Victor Lecomte · Rylan Schaeffer · Yann LeCun · Mikail Khona · Ravid Shwartz-Ziv · Sanmi Koyejo · Andrey Gromov 🔗 |
-
|
Characterizing pre-trained and task-adapted molecular representations
(
Spotlight
)
link »
Pre-trained deep learning models are emerging fast as a tool for enhancing scientific workflow and accelerating scientific discovery. Representation learning is a fundamental task to study the molecular structure–property relationship, which is then leveraged for predicting the molecular properties or designing new molecules with desired attributes. However, evaluating the emerging "zoo" of pre-trained models for various downstream tasks remains challenging. We propose an unsupervised method to characterize embeddings of pre-trained models through the lens of non-parametric group property-driven subset scanning (SS). We assess its detection capabilities with extensive experiments on diverse molecular benchmarks (ZINC-250K, MOSES, MoleculeNet) across predictive chemical language models (MoLFormer, ChemBERTa) and molecular graph generative models (GraphAF, GCPN). We further evaluate how representations evolve as a result of domain adaptation by finetuning or low-dimensional projection.Experiments reveal notable information condensation in the pre-trained embeddings upon task-specific fine-tuning as well as projection techniques. For example, among the top-$120$ most-common elements in the embedding (out of $\approx 700$), only $11$ property-driven elements are shared between the three tasks (BACE, BBBP, and HIV), while $\approx 70$-$80$ of those are unique to each task. This work provides a post-hoc quality evaluation method for representation learning models and domain adaptation methods that is task and modality-agnostic.
|
Celia Cintas · Payel Das · Jarret Ross · Brian Belgodere · Girmaw Abebe Tadesse · Vijil Chenthamarakshan · Jannis Born · Skyler D. Speakman 🔗 |
-
|
WavSpA: Wavelet Space Attention for Boosting Transformers' Long Sequence Learning Ability
(
Spotlight
)
link »
Transformer and its variants are fundamental neural architectures in deep learning. Recent works show that learning attention in the Fourier space can improve the long sequence learning capability of Transformers. We argue that wavelet transform shall be a better choice because it captures both position and frequency information with linear time complexity. Therefore, in this paper, we systematically study the synergy between wavelet transform and Transformers. We propose Wavelet Space Attention (WavSpA) that facilitates attention learning in a learnable wavelet coefficient space which replaces the attention in Transformers by (1) applying forward wavelet transform to project the input sequences to multi-resolution bases, (2) conducting attention learning in the wavelet coefficient space, and (3) reconstructing the representation in input space via backward wavelet transform. Extensive experiments on the Long Range Arena demonstrate that learning attention in the wavelet space using either fixed or adaptive wavelets can consistently improve Transformer’s performance and also significantly outperform learning in Fourier space. We further show our method can enhance Transformer’s reasoning extrapolation capability over distance on the LEGO chain-of-reasoning task. |
Yufan Zhuang · Zihan Wang · Fangbo Tao · Jingbo Shang 🔗 |
-
|
NEUCORE: Neural Concept Reasoning for Composed Image Retrieval
(
Spotlight
)
link »
Composed image retrieval which combines a reference image and a text modifier to identify the desired target image is a challenging task, and requires the model to comprehend both vision and language modalities and their interactions. Existing approaches focus on holistic multi-modal interaction modeling, and ignore the composed and complimentary property between the reference image and text modifier. In order to better utilize the complementarity of multi-modal inputs for effective information fusion and retrieval, we move the multi-modal understanding to fine-granularity at concept-level, and learn the multi-modal concept alignment to identify the visual location in reference or target images corresponding to text modifier. Toward the end, we propose a NEUral COncept REasoning (NEUCORE) model which incorporates multi-modal concept alignment and progressive multi-modal fusion over aligned concepts. Specifically, considering that text modifier may refer to semantic concepts not existing in the reference image and requiring to be added into the target image, we learn the multi-modal concept alignment between the text modifier and the concatenation of reference and target images, under multiple-instance learning framework with image and sentence level weak supervision. Furthermore, based on aligned concepts, to form discriminative fusion features of the input modalities for accurate target image retrieval, we propose a progressive fusion strategy with unified execution architecture instantiated by the attended language semantic concepts. Our proposed approach is evaluated on three datasets and achieves state-of-the-art results. |
Shu Zhao · Huijuan Xu 🔗 |
-
|
Grokking as Simplification: A Nonlinear Complexity Perspective
(
Spotlight
)
link »
We attribute grokking, the phenomenon where generalization is much delayed after memorization, to compression. We define linear mapping number (LMN) to measure network complexity, which is a generalized version of linear region number for ReLU networks. LMN can nicely characterize neural network compression before generalization. Although $L_2$ norm has been popular to characterize model complexity, we argue in favor of LMN for a number of reasons: (1) LMN can be naturally interpreted as information/computation, while $L_2$ cannot. (2) In the compression phase, LMN has nice linear relations with test losses, while $L_2$ is correlated with test losses in a complicated nonlinear way. (3) LMN also reveals an intriguing phenomenon of the XOR network switching between two generalization solutions, while $L_2$ does not. Besides explaning grokking, we argue that LMN is a promising candidate as the neural network version of the Kolmogorov complexity, since it explicitly considers local or conditioned linear computations aligned with the nature of modern artificial neural networks.
|
Ziming Liu · Ziqian Zhong · Max Tegmark 🔗 |
-
|
Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference
(
Spotlight
)
link »
We propose a method to improve the efficiency and accuracy of amortized Bayesian inference (ABI) by leveraging universal symmetries in the probabilistic joint model $p(\theta, y)$ of parameters $\theta$ and data $y$. In a nutshell, we invert Bayes' theorem and estimate the marginal likelihood based on approximate representations of the joint model. Upon perfect approximation, the marginal likelihood is constant across all parameter values by definition. However, approximation error leads to undesirable variance in the marginal likelihood estimates across different parameter values. We formulate violations of this symmetry as a loss function to accelerate the learning dynamics of conditional neural density estimators. We apply our method to a bimodal toy problem with an explicit likelihood (likelihood-based) and a realistic model with an implicit likelihood (simulation-based).
|
Marvin Schmitt · Daniel Habermann · Paul-Christian Bürkner · Ullrich Köthe · Stefan Radev 🔗 |
-
|
Efficient Multimodal Alignment: To Freeze or Not to Freeze?
(
Spotlight
)
link »
Language-image pretraining creates a joint representation space between the two modalities where images and texts with similar semantic information lay close to each other. Language-image models are often trained from scratch without taking advantage of unimodal pretrained models. By aligning the representation spaces of two modality-specific encoders, our model achieves 74.7% accuracy on the ImagenNet1K validation set, at two orders of magnitude lower training cost. In this work, we highlight the importance of unfreezing the CLS tokens of uni-modal transformer encoders to create a joint embedding space. Freezing the image and text CLS tokens reduces the mean accuracy from 37.5% to 19.4% on the 38 evaluation benchmarks. |
Till Aczel · Roger Wattenhofer 🔗 |
-
|
What Does Knowledge Distillation Distill?
(
Spotlight
)
link »
Knowledge distillation is an increasingly-used compression method due to the popularity of large-scale models, but it is unclear if all information a teacher model contains is distilled into the smaller student model. We aim to formalize the concept of `knowledge' to investigate how knowledge is transferred during distillation, focusing on shared invariances to counterfactual changes to dataset latent variables (which we call mechanisms). We define good stand-in student models for the teacher as models that share the teacher's mechanisms, and find Jacobian matching and contrastive representation learning as viable methods to achieve good students. While these methods do not result in perfect transfer of mechanisms, they are likely to improve student fidelity or mitigate simplicity bias (as measured by teacher-student KL divergence and accuracy on various out-of-distribution test datasets), especially on datasets with certain spurious statistical correlations. |
Cindy Wu · Ekdeep S Lubana · Bruno Mlodozeniec · Robert Kirk · David Krueger 🔗 |
-
|
Comparing Representational and Functional Similarity in Small Transformer Language Models
(
Spotlight
)
link »
In many situations, it would be helpful to be able to characterize the solution learned by a neural network, including for answering scientific questions (e.g. how do architecture changes affect generalization) and addressing practical concerns (e.g. auditing for potentially unsafe behavior). One approach is to try to understand these models by studying the representations that they learn---for example, comparing whether two networks learn similar representations. However, it is not always clear how much representation-level analyses can tell us about how a model makes predictions. In this work, we explore this question in the context of small Transformer language models, which we train on a synthetic, hierarchical language task. We train models with different sizes and random initializations, evaluating performance over the course of training and on a variety of systematic generalization splits. We find that existing methods for measuring representation similarity are not always correlated with behavioral metrics---i.e. models with similar representations do not always make similar predictions---and the results vary depending on the choice of representation. Our results highlight the importance of understanding representations in terms of the role they play in the neural algorithm. |
Dan Friedman · Andrew Lampinen · Lucas Dixon · Danqi Chen · Asma Ghandeharioun 🔗 |
-
|
Representational constraints underlying similarity between task-optimized neural systems
(
Spotlight
)
link »
In this study, we investigate the similarity of representations between biological and artificial visual systems that are optimized for object recognition. We propose that this similarity could be a result of constraints on the representation of task-optimized systems, which necessitate the development of an abstraction from the input stimuli. To measure this, we constructed a two-dimensional coordination system in which we measured the distance of each neural representation from the pixel space and the class space. Our results show that proximity in this space predicts the similarity of neural representations between different visual systems. We observe that the trajectories of representations in any given task-optimized visual neural network start close to the pixel space and gradually move towards higher abstract representations such as categories. This suggests that the similarity between different task-optimized systems is due to constraints on representational trajectories, as revealed by the abstraction space. We present abstraction space as a simple yet effective analysis tool to draw inferences on the representations of neural network and to uncover the constraints that lead to similar representations in different visual systems. |
Tahereh Toosi 🔗 |
-
|
A Compact Representation for Bayesian Neural Networks By Removing Permutation Symmetry
(
Spotlight
)
link »
Bayesian neural networks (BNNs) are a principled approach to modeling predictive uncertainties in deep learning, which are important in safety-critical applications. Since exact Bayesian inference over the weights in a BNN is intractable, various approximate inference methods exist, among which sampling methods such as Hamiltonian Monte Carlo (HMC) are often considered the gold standard. While HMC provides high-quality samples, it lacks interpretable summary statistics because its sample mean and variance is meaningless in neural networks due to permutation symmetry. In this paper, we first show that the role of permutations can be meaningfully quantified by a number of transpositions metric. We then show that the recently proposed rebasin method allows us to summarize HMC samples into a compact representation that provides a meaningful explicit uncertainty estimate for each weight in a neural network, thus unifying sampling methods with variational inference. We show that this compact representation allows us to compare trained BNNs directly in weight space across sampling methods and variational inference, and to efficiently prune neural networks trained without explicit Bayesian frameworks by exploiting uncertainty estimates from HMC. |
Tim Xiao · Weiyang Liu · Robert Bamler 🔗 |
-
|
ReWaRD: Retinal Waves for Pre-Training Artificial Neural Networks Mimicking Real Prenatal Development
(
Spotlight
)
link »
Computational models trained on a large amount of natural images are the state-of-the-art to study human vision -- usually adult vision. Computational models of infant vision and its further development are gaining more and more attention in the community. In this work we aim at the very beginning of our visual experience -- pre- and post-natal retinal waves which suggest to be a pre-training mechanism for the human visual system at a very early stage of development. We see this approach as an instance of biologically plausible data driven inductive bias through pre-training. We built a computational model that mimics this development mechanism by pre-training different artificial convolutional neural networks with simulated retinal wave images. The resulting features of this biologically plausible pre-training closely match the V1 features of the human visual system. We show that the performance gain by pre-training with retinal waves is similar to a state-of-the art pre-training pipeline. Our framework contains the retinal wave generator, as well as a training strategy, which can be a first step in a curriculum learning based training diet for various models of development. We release code, data and trained networks to build the basis for future work on visual development and based on a curriculum learning approach including prenatal development to support studies of innate vs. learned properties of the human visual system. An additional benefit of our pre-trained networks for neuroscience or computer vision applications is the absence of biases inherited from datasets like ImageNet. |
Benjamin Cappell · Andreas Stoll · Chukwudi Umah · Bernhard Egger 🔗 |
-
|
Multimodal decoding of human brain activity into images and text
(
Spotlight
)
link »
Every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. We utilized the Natural Scenes Dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. We employed the Generative Image-to-text Transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. The method involves training regularized linear regression models between brain activity and extracted features. Additionally, we incorporated depth maps from the ControlNet model to further guide the reconstruction process. We propose a multimodal based approach that leverage similarities between neural and deep learning presentation and by learning alignment between these spaces, we produce textual description and image reconstruction from brain activity.We evaluate our methods using quantitative metrics for both generated captions and images. Our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. Our approach provides a flexible platform for future research, with potential applications based on a combination of high-level semantic information coming from text and low-level image shape information coming from depth maps and initial guess images. |
Matteo Ferrante · Tommaso Boccato · Furkan Ozcelik · Rufin VanRullen · Nicola Toschi 🔗 |
-
|
Predictive variational autoencoder for learning robust representations of time-series data
(
Spotlight
)
link »
Variational autoencoders (VAEs) have been used extensively to discover low-dimensional latent factors governing neural activity and animal behavior. However, without careful model selection, the uncovered latent factors may reflect noise in the data rather than true underlying features, rendering such representations unsuitable for scientific interpretation. Existing solutions to this problem involve introducing additional measured variables or data augmentations specific to a particular data type. We propose a VAE architecture that predicts the next point in time and show that it mitigates the learning of spurious features. In addition, we introduce a model selection metric based on smoothness over time in the latent space. We show that together these two constraints on VAEs to be smooth over time produce robust latent representations and faithfully recover latent factors on synthetic datasets. |
Julia Wang · Dexter Tsin · Tatiana Engel 🔗 |
-
|
Instruction-tuned LLMs with World Knowledge are More Aligned to the Human Brain
(
Spotlight
)
link »
Instruction-tuning is a widely adopted method of finetuning that enables large language models (LLMs) to generate output that more closely resembles human responses to natural language queries, in many cases leading to human-level performance on diverse testbeds. However, it remains unclear whether instruction-tuning truly makes LLMs more similar to how humans process language. We investigate the effect of instruction-tuning on brain alignment, the similarity of LLM internal representations to neural activity in the human language system. We assess 25 vanilla and instruction-tuned LLMs across three datasets involving humans reading naturalistic stories and sentences, and discover that instruction-tuning generally enhances brain alignment by an average of 6%. To identify the factors underlying LLM-brain alignment, we compute the correlation between the brain alignment of LLMs and various model properties, such as model size, performance ability on problem-solving benchmarks, and ability on benchmarks requiring world knowledge spanning various domains. Notably, we find a strong positive correlation between brain alignment and model size (r = 0.95), as well as performance on tasks requiring world knowledge (r = 0.81). Our results demonstrate that instruction-tuning LLMs improves both world knowledge representations and human brain alignment, suggesting that mechanisms that encode world knowledge in LLMs also improve representational alignment to the human brain. |
Khai Loong Aw · Syrielle Montariol · Badr AlKhamissi · Martin Schrimpf · Antoine Bosselut 🔗 |
-
|
On Transferring Expert Knowledge from Tabular Data to Images
(
Spotlight
)
link »
Transferring knowledge across modalities has garnered significant attention in the field of machine learning as it enables the utilization of expert knowledge from diverse domains. In particular, the representation of expert knowledge in tabular form, commonly found in fields such as medicine, can greatly enhance the comprehensiveness and accuracy of image-based learning.However, the transfer of knowledge from tabular to image data presents unique challenges due to the distinct characteristics of these data types, making it challenging to determine "how to reuse" and "which subset to reuse". To address this, we propose a novel method called CHannel tAbulaR alignment with optiMal tranSport (CHARMS) that automatically and effectively transfers relevant tabular knowledge. Specifically, by maximizing the mutual information between a group of channels and tabular features, our method modifies the visual embedding and captures the semantics of tabular knowledge. The alignment between channels and attributes helps select the subset of tabular data which contains knowledge to images. Experimental results demonstrate that {\sc Charms} effectively reuses tabular knowledge to improve the performance and interpretability of visual classifiers. |
Jun-Peng Jiang · Han-Jia Ye · Leye Wang · Yang Yang · Yuan Jiang · De-Chuan Zhan 🔗 |
-
|
On the Robustness of Neural Collapse and the Neural Collapse of Robustness
(
Spotlight
)
link »
Neural Collapse refers to the curious phenomenon in the end of training of a neural network, where feature vectors and classification weights converge to a very simple geometrical arrangement (a simplex). While it has been observed empirically in various cases and has been theoretically motivated, its connection with crucial properties of neural networks, like their generalization and robustness, remains unclear. In this work, we study the stability properties of these simplices. We find that the simplex structure disappears under small adversarial attacks, and that perturbed examples ``leap" between simplex vertices. We further analyze the geometry of networks that are optimized to be robust against adversarial perturbations of the input, and find that Neural Collapse is a pervasive phenomenon in these cases as well, with clean and perturbed representations forming aligned simplices, and giving rise to a robust simple nearest-neighbor classifier. By studying the propagation of the amount of collapse inside the network, we identify novel properties of both robust and non-robust machine learning models, and show that earlier, unlike later layers maintain reliable simplices on perturbed data. |
Jingtong Su · Ya Shi Zhang · Nikolaos Tsilivis · Julia Kempe 🔗 |
-
|
MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation
(
Spotlight
)
link »
Pre-training has been an important ingredient in developing strong monocular depth estimation models in recent years. For instance, self-supervised learning (SSL) is particularly effective by alleviating the need for large datasets with dense ground-truth depth maps. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a unified framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Hence, MeSa benefits from not only general-purpose representations learnt via masked pre-training but also specialized depth-specific features acquired via geometric and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset. |
Muhammad Osama Khan · Junbang Liang · Chun-Kai Wang · Shan Yang · Yu Lou 🔗 |
-
|
Linearly Structured World Representations in Maze-Solving Transformers
(
Spotlight
)
link »
The emergence of seemingly similar representations across tasks and neural architectures suggests that convergent properties may underlie sophisticated behavior. One form of representation that seems particularly fundamental to reasoning in many artificial (and perhaps natural) networks is the formation of world models, which decompose observed task structures into re-usable perceptual primitives and task-relevant relations. In this work, we show that auto-regressive transformers tasked with solving mazes learn to linearly represent the structure of mazes, and that the formation of these representations coincides with a sharp increase in generalization performance. Furthermore, we find preliminary evidence for Adjacency Heads which may play a role in computing valid paths through mazes. |
Michael Ivanitskiy · Alexander Spies · Tilman Räuker · Guillaume Corlouer · Christopher Mathwin · Lucia Quirke · Can Rager · Rusheb Shah · Dan Valentine · Cecilia Diniz Behn · Katsumi Inoue · Samy Wu Fung
|
-
|
A General Method for Testing Bayesian Models using Neural Data
(
Spotlight
)
link »
Bayesian models have been successful in explaining human and animal behavior, but the extent to which they can also explain neural activity is still an open question. A major obstacle to answering this question is that current methods for generating neural predictions require detailed and specific assumptions about the encoding of posterior beliefs in neural responses, with no consensus or decisive data about the nature of this encoding. Here, we present a new method and prove conditions for its validity, that overcomes these challenges for a wide class of probabilistic encodings -- including the two major classes of neural sampling and distributed distributional codes. Our method tests whether the relationships between the model posteriors for different stimuli match the relationships between the corresponding neural responses -- akin to representational similarity analysis (RSA), a widely used method for nonprobabilistic models. Finally, we present a new model comparison diagnostic for our method, based not on the agreement of the model with the data directly, but on the alignment of the model and data when injecting noise in our neural prediction generation method. We illustrate our method using simulated V1 data and compare two Bayesian models that are practically indistinguishable using behavior alone. Our results show a powerful new way to rigorously test Bayesian models on neural data. |
Gabor Lengyel · Sabyasachi Shivkumar · Ralf Haefner 🔗 |
-
|
Degradation and plasticity in convolutional neural networks: An investigation of internal representations
(
Spotlight
)
link »
The architecture and information processing of convolutional neural networks was originally heavily inspired by the biological visual system. In this work, we make use of these similarities to create an in silico model of neurodegenerative diseases affecting the visual system. We examine layer-wise internal representations and accuracy levels of the model as it is subjected to synaptic decay and retraining to investigate if it is possible to capture a biologically realistic profile of visual cognitive decline. Therefore, we progressively decay and freeze model synapses in a highly compressed model trained for object recognition. Between each iteration of progressive model degradation, we retrain the remaining unaffected synapses on subsets of initial training data to simulate continual neuroplasticity. The results of this work show that even with high levels of synaptic decay and limited retraining data, the model is able to regain internal representations similar to that of the unaffected, healthy model. We also demonstrate that throughout a complete cycle of model degradation, the early layers of the model retain high levels of centered kernel alignment similarity, while later layers containing high-level information are much more susceptible to deviate from the healthy model. |
Jasmine Moore · Vibujithan Vigneshwaran · Matthias Wilms · Nils Daniel Forkert 🔗 |
-
|
On the Direct Alignment of Latent Spaces
(
Spotlight
)
link »
With the wide adaption of deep learning and pre-trained models rises the question of how to effectively reuse existing latent spaces for new applications.One important question is how the geometry of the latent space changes in-between different training runs of the same architecture and different architectures trained for the same task. Previous works proposed that the latent spaces for similar tasks are approximately isometric. However, in this work we show that method restricted to this assumption perform worse than when just using a linear transformation to align the latent spaces. We propose directly computing a transformation between the latent codes of different architectures which is more efficient than previous approaches and flexible wrt. to the type of transformation used. Our experiments show that aligning the latent space with a linear transformation performs best while not needing more prior knowledge. |
Zorah Lähner · Michael Moeller 🔗 |
-
|
Reinforcement Learning with Augmentation Invariant Representation: A Non-contrastive Approach
(
Spotlight
)
link »
Data augmentation has been proven as an effective measure to improve generalization performance in reinforcement learning (RL). However, recent approaches directly use the augmented data to learn the value estimate or regularize the estimation, often ignoring the core essence that the model needs to learn that augmented data indeed represents the same state. In this work, we present RAIR: Reinforcement learning with Augmentation Invariant Representation that disentangles the representation learning task from the RL task and aims to learn similar latent representations for the original observation and the augmented one. Our approach learns the representation of high-dimensional visual observations in a non-contrastive self-supervised way combined with the standard RL objective. In particular, RAIR gradually pushes the latent representation of an observation closer to the representation produced for the corresponding augmented observations. Thus, our agent is more resilient to the changes in the environment. We evaluate RAIR on all sixteen environments from the RL generalization benchmark Procgen. The experimental results indicate that RAIR outperforms PPO and other data augmentation-based approaches under the standard evaluation protocol. |
Nasik Muhammad Nafi · William Hsu 🔗 |
-
|
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
(
Spotlight
)
link »
The landscape of publicly available vision foundation models (VFMs), such as CLIP and SAM, is expanding rapidly. VFMs are endowed with distinct capabilities stemming from their pretraining objectives. For instance, CLIP excels in semantic understanding, while SAM specializes in spatial understanding for segmentation. In this work, we introduce a simple recipe based on multi-task distillation to efficiently merge VFMs into a unified model that assimilates their expertise. By applying our method to SAM and CLIP, we derive SAM-CLIP: a unified model that amalgamates the strengths of SAM and CLIP into a single backbone, making it apt for edge device applications. We show that SAM-CLIP learns richer visual representations, equipped with both localization and semantic features, suitable for a broad range of vision tasks. We further show that SAM-CLIP not only retains the foundational strengths of its precursor models but also introduces \emph{synergistic functionalities}, most notably in zero-shot semantic segmentation, where SAM-CLIP establishes new state-of-the-art results. It outperforms previous models that are specifically designed for this task by a large margin, including +6.8\% and +5.9\% mean IoU improvement on Pascal-VOC and COCO-Stuff datasets, respectively. |
Haoxiang Wang · Pavan Kumar Anasosalu Vasu · Fartash Faghri · Raviteja Vemulapalli · Mehrdad Farajtabar · Sachin Mehta · Mohammad Rastegari · Oncel Tuzel · Hadi Pouransari 🔗 |
-
|
Comparing neural models using their perceptual discriminability predictions
(
Spotlight
)
link »
A variety of methods have been developed to compare models of visual representation. However, internal representations are not uniquely identifiable from perceptual measurements: different representations can generate identical perceptual predictions, and dissimilar model representations (according to existing model comparison methods) do not guarantee dissimilar perceptual predictions. Here, we generalize a previous method (“eigendistortions” - Berardino et al, 2017) to compare models based on their metric tensors. Metric tensors characterize a model’s sensitivity to stimulus perturbations, reflecting both the geometric and stochastic properties of the representation, and providing an explicit prediction of perceptual discriminability. Brute force comparison of model-predicted metric tensors using human perceptual thresholds would require an impossibly large set of measurements, since one needs to perturb a stimulus in all possible orthogonal directions. To circumvent this “perceptual curse of dimensionality”, we compute and measure discrimination capabilities for a small set of most-informative perturbations, reducing the measurement cost from thousands of hours (a conservative estimate) to a single trial. We show that this single measurement, made for a variety of different test stimuli, is sufficient to differentiate models, select models that better match human perception, or generate new models that combine the advantages of both. We demonstrate the power of this method in assessing two examples: 1) comparing models for color discrimination; 2) comparing autoencoders trained with differentregularizers. |
Jingyang Zhou · Chanwoo Chun · Ajay Subramanian · Eero Simoncelli 🔗 |
-
|
Semi-Ensemble: A Simple Approach Over-parameterize Model Interpolation
(
Spotlight
)
link »
We develop a unified framework for interpolating two models with various degrees of over-parameterization, having model merging and model ensemble as special cases. Instead of directly interpolating models in their original parameter space, the proposed Semi-Ensemble interpolates the over-parameterized versions of the models in a higher-dimensional joint parameter space. Here, the over-parameterizations recover each endpoint model when projected to some low-dimensional subspace spanned by a fraction of bases. By carefully constructing the joint parameter space, the interpolated model can achieve a smooth tradeoff between the total number of parameters and the model accuracy, outperforming existing baselines. Intriguingly, we show that Semi-ensembles can sometimes achieve a better performance than vanilla ensembles, even with a slightly smaller number of parameters. |
Jiwoon Lee · Jaeho Lee 🔗 |
-
|
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
(
Spotlight
)
link »
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. While CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This paper studies the following question: Can we augment CLIP training with task-specific vision models from model zoos to improve its visual representations? Towards this end, we leverage open-source task-specific vision models to generate pseudo-labels for an uncurated web-scale image-text dataset. Subsequently, we train CLIP models on these pseudo-labels in addition to the contrastive training on image and text pairs. This simple setup shows substantial improvements of up to 16.3\% across different vision tasks, including segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are achieved without compromising CLIP's existing capabilities, including its proficiency in promptable zero-shot classification. |
Mohammadreza (Reza) Salehi · Mehrdad Farajtabar · Maxwell Horton · Fartash Faghri · Hadi Pouransari · Raviteja Vemulapalli · Oncel Tuzel · Ali Farhadi · Mohammad Rastegari · Sachin Mehta 🔗 |
-
|
Understanding Learning Dynamics of Neural Representations via Feature Visualization at Scale
(
Spotlight
)
link »
How does feature learning happen during the training of a neural network? We developed an accelerated pipeline to synthesize maximally activating images ("prototypes") for hidden units in a parallel fashion. Through this, we were able to perform feature visualization at scale, and to track the emergence and development of visual features across the training of neural networks. Using this technique, we studied the `developmental' process of features in a convolutional neural network trained from scratch using SimCLR with or without color jittering augmentation. After creating over one million prototypes with our method, tracking and comparing these visual signatures showed that the color-jittering augmentation led to constantly diversifying high-level features during training, while no color-jittering led to more diverse low-level features but less development of high-level features. These results illustrate how feature visualization can be used to understand training dynamics under different training objectives and data distribution. |
Chandana Kuntala · Deepak Sharma · Carlos Ponce · Binxu Wang 🔗 |
-
|
Unsupervised learning on spontaneous retinal activity leads to efficient neural representation geometry
(
Spotlight
)
link »
Prior to the onset of vision, neurons in the developing mammalian retina spontaneously fire in correlated activity patterns known as retinal waves. Experimental evidence suggests that retinal waves strongly influence the emergence of sensory representations before visual experience. We aim to model this early stage of functional development by using movies of neurally active developing retinas as pre-training data for neural networks. Specifically, we pre-train a ResNet-18 with an unsupervised contrastive learning objective (SimCLR) on both simulated and experimentally-obtained movies of retinal waves, then evaluate its performance on image classification tasks. We find that pre-training on retinal waves significantly improves performance on tasks that test object invariance to spatial translation, while slightly improving performance on more complex tasks like image classification. Notably, these performance boosts are realized on held-out natural images even though the pre-training procedure does not include any natural image data. We then propose a geometrical explanation for the increase in network performance, namely that the spatiotemporal characteristics of retinal waves facilitate the formation of separable feature representations. In particular, we demonstrate that networks pre-trained on retinal waves are more effective at separating image manifolds than randomly initialized networks, especially for manifolds defined by sets of spatial translations. These findings indicate that the broad spatiotemporal properties of retinal waves prepare networks for higher order feature extraction. |
Andrew Ligeralde · Yilun Kuang · Thomas Yerxa · Miah Pitcher · Marla Feller · SueYeon Chung 🔗 |
-
|
Estimating shape distances on neural representations with limited samples
(
Spotlight
)
link »
Measuring geometric similarity between high-dimensional network representations is a topic of longstanding interest to neuroscience and deep learning. Although many methods have been proposed, only a few works have rigorously analyzed their statistical efficiency or quantified estimator uncertainty in data-limited regimes. Here, we derive upper and lower bounds on the worst-case convergence of standard estimators of shape distance—a measure of representational dissimilarity proposed by Williams et al. (2021). These bounds reveal the challenging nature of the problem in high-dimensional feature spaces. To overcome these challenges, we introduce a novel method-of-moments estimator with a tunable bias-variance tradeoff parameterized by an upper bound on bias. We show that this estimator achieves superior performance to standard estimators in simulation and on neural data, particularly in high-dimensional settings. Our theoretical work and estimator thus respectively define and dramatically expand the scope of neural data for which geometric similarity can be accurately measured. |
Dean Pospisil · Brett Larsen · Sarah Harvey · Alex Williams 🔗 |
-
|
Growing Brains: Co-emergence of Anatomical and Functional Modularity in Recurrent Neural Networks
(
Spotlight
)
link »
Recurrent neural networks (RNNs) trained on compositional tasks can exhibit functional modularity, in which neurons can be clustered by activity similarity and specialization on a shared computational subtask. Unlike brains, these RNNs do not exhibit anatomical modularity, in which functional clustering is correlated with strong recurrent coupling and spatial localization of functional clusters. Contrasting with functional modularity, which can be ephemerally dependent on the input, anatomically modular networks form a robust substrate for solving the same subtasks in the future. To examine whether it is possible to grow brain-like anatomical modularity, we apply a recent machine learning method, brain-inspired modular training (BIMT), to a network being trained to solve a set of compositional tasks. We find that functional and anatomical clustering emerge together, such that functionally similar neurons also become spatially localized and interconnected. Moreover, compared to standard $L_1$ regularization or no regularization settings, the model exhibits superior performance by optimally balancing task performance and network sparsity. In addition to achieving brain-like organization in RNNs, our findings also suggest that BIMT holds promise for applications in neuromorphic computing and enhancing the interpretability of neural network architectures.
|
Ziming Liu · Mikail Khona · Ila Fiete · Max Tegmark 🔗 |
-
|
Towards Measuring Representational Similarity of Large Language Models
(
Spotlight
)
link »
Understanding the similarity of the numerous large language models released has many uses, e.g., simplifying model selection, detecting illegal model reuse, and advancing our understanding of what makes LLMs perform well. In this work, we measure the similarity of representations of a set of LLMs with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions. |
Max Klabunde · Mehdi Ben Amor · Michael Granitzer · Florian Lemmerich 🔗 |
-
|
Bio-inspired parameter reuse: Exploiting inter-frame representation similarity with recurrence for accelerating temporal visual processing
(
Spotlight
)
link »
Feedforward neural networks are the dominant approach in current computer vision research. They typically do not incorporate recurrence, which is a prominent feature of biological vision brain circuitry. Inspired by biological findings, we introduce $\textbf{RecSlowFast}$, a recurrent slow-fast framework aimed at showing how recurrence can be useful for temporal visual processing. We perform a variable number of recurrent steps of certain layers in a network receiving input video frames, where each recurrent step is equivalent to a feedforward layer with weights reuse. By harnessing the hidden states extracted from the previous input frame, we reduce the computation cost by executing fewer recurrent steps on temporally correlated consecutive frames, while keeping good task accuracy. The early termination of the recurrence can be dynamically determined through newly introduced criteria based on the distance between hidden states and without using any auxiliary scheduler network. RecSlowFast $\textbf{reuses a single set of parameters}$, unlike previous work which requires one computationally heavy network and one light network, to achieve the speed versus accuracy trade-off. Using a new $\textit{Temporal Pathfinder}$ dataset proposed in this work, we evaluate RecSlowFast on a task to continuously detect the longest evolving contour in a video. The slow-fast inference mechanism speeds up the average frame per second by 279% on this dataset with comparable task accuracy using a desktop GPU. We further demonstrate a similar trend on CamVid, a video semantic segmentation dataset.
|
Zuowen Wang · Longbiao Cheng · Joachim Ott · Pehuen Moure · Shih-Chii Liu 🔗 |
-
|
UniCat: Crafting a Stronger Fusion Baseline for Multimodal Re-Identification
(
Spotlight
)
link »
Multimodal Re-Identification (ReID) is a popular retrieval task that aims to re-identify objects across diverse data streams, prompting many researchers to integrate multiple modalities into a unified representation. While such fusion promises a holistic view, our investigations shed light on potential pitfalls. We uncover that prevailing late-fusion techniques often produce suboptimal latent representations when compared to methods that train modalities in isolation. We argue that this effect is largely due to the inadvertent relaxation of the training objectives on individual modalities when using fusion, what others have termed modality laziness. We present a nuanced point-of-view that this relaxation can lead to certain modalities failing to fully harness available task-relevant information, and yet, offers a protective veil to noisy modalities, preventing them from overfitting to task-irrelevant data. Our findings also show that unimodal concatenation (UniCat) and other late-fusion ensembling of unimodal backbones, when paired with best-known training techniques, exceed the current state-of-the-art performance across several multimodal ReID benchmarks. By unveiling the double-edged sword of "modality laziness", we motivate future research in balancing local modality strengths with global representations. |
Jennifer Crawford · Haoli Yin · Luke McDermott · Daniel Cummings 🔗 |
-
|
Testing Assumptions Underlying a Unified Theory for the Origin of Grid Cells
(
Spotlight
)
link »
Representing and reasoning about physical space is fundamental to animal survival, and the mammalian lineage expresses a wealth of specialized neural representations that encode space. Grid cells, whose discovery earned a Nobel prize, are a striking example: a grid cell is a neuron that fires if and only if the animal is spatially located at the vertices of a regular triangular lattice that tiles all explored two-dimensional environments. Significant theoretical work has gone into understanding why mammals have learned these particular representations, and recent work has proposed a ``unified theory for the computational and mechanistic origin of grid cells," claiming to answer why the mammalian lineage has learned grid cells. However, the Unified Theory makes a series of highly specific assumptions about the target readouts of grid cells - putatively place cells. In this work, we explicitly identify what these mathematical assumptions are, then test two of the critical assumptions using biological place cell data. At both the population and single-cell levels, we find evidence suggesting that neither of the assumptions are likely true in biological neural representations. These results call the Unified Theory into question, suggesting that biological grid cells likely have a different origin than those obtained in trained artificial neural networks. |
Rylan Schaeffer · Mikail Khona · Adrian Bertagnoli · Sanmi Koyejo · Ila Fiete 🔗 |
-
|
Understanding Mode Connectivity via Parameter Space Symmetry
(
Spotlight
)
link »
It has been observed that the global minimum of neural networks is connected by curves on which train and test loss is almost constant. This phenomenon, often referred to as mode connectivity, has inspired various applications such as model ensembling and fine-tuning. However, despite empirical evidence, a theoretical explanation is still lacking. We explore the connectedness of minimum through a new approach, parameter space symmetry. By relating topology of symmetry groups to topology of minima, we provide the number of connected components of full-rank linear networks. In particular, we show that skip connections reduce the number of connected components. We then prove mode connectivity up to permutation for linear networks. We also provide explicit expressions for connecting curves in minimum induced by symmetry. |
Bo Zhao · Nima Dehmamy · Robin Walters · Rose Yu 🔗 |
-
|
Channel Vision Transformers: An Image Is Worth C x 16 x 16 Words
(
Spotlight
)
link »
Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a learnable channel embedding that is added to the patch tokens, similar to positional embeddings. We evaluate the performance of ChannelViT on ImageNet, JUMP-CP (microscopy cell imaging), and So2Sat (satellite imaging). Our results show that ChannelViT outperforms ViT on classification tasks and generalizes well, even when a subset of input channels is used during testing. Across our experiments, HCS proves to be a powerful regularizer, independent of the architecture employed, suggesting itself as a straightforward technique for robust ViT training. Lastly, we find that ChannelViT generalizes effectively even when there is limited access to all channels during training, highlighting its potential for multi-channel imaging under real-world conditions with sparse sensors. |
Yujia Bao · Srinivasan Sivanandan · Theofanis Karaletsos 🔗 |
-
|
DisCoV: Disentangling Time Series Representations via Contrastive based $l$-Variational Inference
(
Spotlight
)
link »
Learning disentangled representations is crucial for Time Series, offering benefits like feature derivation and improved interpretability, thereby enhancing task performance. We focus on disentangled representation learning for home appliance electricity usage, enabling users to understand and optimize their consumption for a reduced carbon footprint. Our approach frames the problem as disentangling each attribute's role in total consumption (e.g., dishwashers, fridges, \dots). Unlike existing methods assuming attribute independence, we acknowledge real-world time series attribute correlations, like the operating of dishwashers and washing machines during the winter season. To tackle this, we employ weakly supervised contrastive disentanglement, facilitating representation generalization across diverse correlated scenarios and new households. Our method utilizes innovative $l$-variational inference layers with self-attention, effectively addressing temporal dependencies across bottom-up and top-down networks. We find that DisCoV (Disentangling via Contrastive $l$-Variational) can enhance the task of reconstructing electricity consumption for individual appliances. We introduce TDS (Time Disentangling Score) to gauge disentanglement quality. TDS reliably reflects disentanglement performance, making it a valuable metric for evaluating time series representations. Code available at https://anonymous.4open.science/r/DisCo.
|
Khalid OUBLAL · Said Ladjal · David Benhaiem · Emmanuel LE BORGNE · François Roueff 🔗 |
-
|
Mixture of Multimodal Interaction Experts
(
Spotlight
)
link »
Multimodal machine learning, which studies the information and interactions across various input modalities, has made significant advancements in understanding the relationship between images and descriptive text. Yet, this is just a portion of the potential multimodal interactions in the real world, such as sarcasm in conflicting utterance and gestures. Notably, the current methods for capturing this shared information often don't extend well to these more nuanced interactions. Current models, in fact, show particular weaknesses with disagreement and synergistic interactions, sometimes performing as low as 50\% in binary classification. In this paper, we address this problem via a new approach called mixture of multimodal interaction experts. This method automatically classifies datapoints from unlabeled multimodal dataset by their intereaction types, then employs specialized models for each specific interaction. Based on our experiments, this approach has improved performance on these challenging interactions to more than 10%, leading to an overall increase of 2% for tasks like sarcasm prediction. As a result, not only does interaction quantification provide new insights for dataset analysis, but also simple approaches to obtain state-of-the-art performance. |
Haofei Yu · Paul Pu Liang · Russ Salakhutdinov · Louis-Philippe Morency 🔗 |
-
|
Inverted-Attention Transformers can Learn Object Representations: Insights from Slot Attention
(
Spotlight
)
link »
Visual reasoning is supported by a causal understanding of the physical world, and theories of human cognition suppose that a necessary step to causal understanding is the discovery and representation of high-level entities like objects. Slot Attention is a popular method aimed at object-centric learning, and its popularity has resulted in dozens of variants and extensions. To help understand the core assumptions that lead to successful object-centric learning, we take a step back and identify the minimal set of changes to a standard Transformer architecture to obtain the same performance as the specialized Slot Attention models. We systematically evaluate the performance and scaling behaviour of several ``intermediate'' architectures on seven image and video datasets from prior work. Our analysis reveals that by simply inverting the attention mechanism of Transformers, we obtain performance competitive with state-of-the-art Slot Attention in several domains. |
Yi-Fu Wu · Klaus Greff · Gamaleldin Elsayed · Michael Mozer · Thomas Kipf · Sjoerd van Steenkiste 🔗 |
-
|
Increasing Brain-LLM Alignment via Information-Theoretic Compression
(
Spotlight
)
link »
Recent work has discovered similarities between learned representations in large language models (LLMs) and human brain activity during language processing. However, it remains unclear what information LLM and brain representations share. In this work, inspired by a notion that brain data may include information not captured by LLMs, we apply an information bottleneck method to generate compressed representations of fMRI data. For certain brain regions in the frontal cortex, we find that compressing brain representations by a small amount increases their similarity to both BERT and GPT2 embeddings. Thus, our method not only improves LLM-brain alignment scores but also suggests important characteristics about the amount of information captured by each representation scheme. |
Mycal Tucker · Greta Tuckute 🔗 |
-
|
NoPose-NeuS: Jointly Optimizing Camera Poses with Neural Implicit Surfaces for Multi-view Reconstruction
(
Spotlight
)
link »
Learning neural implicit surfaces from volume rendering has become popular for multi-view reconstruction. Neural surface reconstruction approaches can recover complex 3D geometry that are difficult for classical Multi-view Stereo (MVS) approaches, such as non-Lambertian surfaces and thin structures. However, one key assumption for these methods is knowing accurate camera parameters for the input multi-view images, which are not always available. In this paper, we present NoPose-NeuS, a neural implicit surface reconstruction method that extends NeuS to jointly optimize camera poses with the geometry and color networks. We encode the camera poses as a multi-layer perceptron (MLP) and introduce two additional losses, which are multi-view feature consistency and rendered depth losses, to constrain the learned geometry for better estimated camera poses and scene surfaces. Extensive experiments on the DTU dataset show that the proposed method can estimate relatively accurate camera poses, while maintaining a high surface reconstruction quality with 0.89 mean Chamfer distance. |
Mohamed Sabae · Hoda Baraka · Mayada Hadhoud 🔗 |
-
|
Functional Modularity in Mind and Machine
(
Spotlight
)
link »
Modularity is a well established and foundational organisational principle of the brain. Neural modules are composed of neurons which are selective to particular sensory input or situations and tend to be organised in close proximity. Yet, establishing which neurons are coupled to implement a neural module is difficult to determine. Consequently, establishing the specifics of what exactly a neural module is selective for is also difficult. In both cases this is due to the difference between functional and architectural modularity. Architectural modularity results due to the explicit connections between neurons in a network. Thus, neurons which are connected form a module and the physical module can be probed to determine what it is selective for. Functional modularity, however, is only detectable in the behaviour of a subset of neurons in the network, but has no explicit pressure forcing its emergence outside of the learning algorithm interacting with the statistics of sensory experience. Thus, while we understand how broad regions of the brain are connected, more nuance is still required to obtain a better understanding of the degree of modularity. This problem is not limited, however, to biological neural networks, but artificial ones as well. ReLU networks for example have the ability to switch off regions of the hidden layer depending on the input data being presented. However, what each hidden neuron is selective for, which hidden neurons are functionally coupled and the meso-scale behaviour of the hidden layer is not well understood. In this work, we begin to understand the emergence and behaviour of functional neural modules in both ReLU and biological neural networks. We achieve this by drawing an equivalence between Gated Deep Linear Networks (GDLNs) and the respective networks by mapping from functional neural modules onto architectural modules of the GDLN. Through the lens of the GDLN we are able to obtain a number of insights for how information is distributed in artificial and biological brains to support context-sensitive controlled semantic cognition. |
Devon Jarvis · Richard Klein · Benjamin Rosman · Andrew Saxe 🔗 |
-
|
Disentangling Linear Mode Connectivity
(
Spotlight
)
link »
Linear mode-connectivity (LMC) (or lack thereof) is one of the intriguing characteristics of neural network loss landscapes. While empirically well established, it unfortunately still lacks a proper theoretical understanding. Even worse, although empirical data points are abound, a systematic study of when networks exhibit LMC is largely missing in the literature. In this work we aim to close this gap. We explore how LMC is affected by three factors: (1) architecture (sparsity, weight-sharing), (2) training strategy (optimization setup) as well as (3) the underlying dataset. We place particular emphasis on minimal but non-trivial settings, removing as much unnecessary complexity as possible. We believe that our insights can guide future theoretical works on uncovering the inner workings of LMC. |
Gül Sena Altıntaş · Gregor Bachmann · Lorenzo Noci · Thomas Hofmann 🔗 |
-
|
Model Merging by Gradient Matching
(
Spotlight
)
link »
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. |
Nico Daheim · Thomas Möllenhoff · Edoardo Maria Ponti · Iryna Gurevych · Mohammad Emtiyaz Khan 🔗 |
-
|
Object-Centric Semantic Vector Quantization
(
Spotlight
)
link »
Neural discrete representations are crucial components of modern neural networks. However, their main limitation is that the primary strategies such as VQ-VAE can only provide representations at the patch level. Therefore, one of the main goals of representation learning, acquiring conceptual, semantic, and compositional abstractions such as the color and shape of an object, remains elusive. In this paper, we present the first approach to semantic neural discrete representation learning. The proposed model, called Semantic Vector-Quantized Variational Autoencoder (SVQ), leverages recent advances in unsupervised object-centric learning to address this limitation. Specifically, we observe that a simple approach quantizing at the object level poses a significant challenge and propose constructing scene representations hierarchically, from low-level discrete concept schemas to object representations. Additionally, we suggest a novel method for training a prior over these semantic representations, enabling the ability to generate images following the underlying data distribution, which is lacking in most object-centric models. In experiments on various 2D and 3D object-centric datasets, we find that our model achieves superior generation performance compared to non-semantic vector quantization methods such as VQ-VAE and previous object-centric generative models. Furthermore, we find that the semantic discrete representations can solve downstream scene understanding tasks that require reasoning about the properties of different objects in the scene. |
Yi-Fu Wu · Minseung Lee · Sungjin Ahn 🔗 |
-
|
Evaluation of Representational Similarity Scores Across Human Visual Cortex
(
Spotlight
)
link »
We investigate several popular methods for quantifying the similarity between neural representations applied to a large-scale fMRI dataset of human ventral visual cortex. We focus on representational geometry as a framework for comparing various functionally-defined high-level regions of interest (ROIs) in the ventral stream. We benchmark Representational Similarity Analysis, Centered Kernel Alignment, and Generalized Shape Metrics. We explore how well the geometry implied by pairwise representational dissimilarity scores produced by each method matches the 2D anatomical geometry of visual cortex. Our results suggest that while these methods yield similar outcomes, Shape Metrics provide distances between representations whose relation to the anatomical geometry is most invariant across subjects. Our work establishes a criterion with which to compare methods for quantifying representational similarity with implications for studying the anatomical organization of high-level ventral visual cortex. |
Francisco Acosta · Colin Conwell · David Klindt · Nina Miolane 🔗 |
-
|
Supervising Variational Autoencoder Latent Representations with Language
(
Spotlight
)
link »
Supervising latent representations of data is of great interest for modern multi-modal generative machine learning. In this work, we propose two new methods to use text to condition the latent representations of a VAE, and evaluate them on a novel conditional image-generation benchmark task. We find that the applied methods can be used to generate highly accurate reconstructed images through language querying with minimal compute resources. Our methods are quantitatively successful at conforming to textually-supervised attributes of an image while keeping unsupervised attributes constant. At large, we present critical observations on disentanglement between supervised and unsupervised properties of images and identify common barriers to effective disentanglement. |
Thomas Lu · Aboli Marathe · Ada Martin 🔗 |
-
|
Implicit Representations for Image Segmentation
(
Spotlight
)
link »
Image segmentation has greatly advanced over the past ten years. Yet, even the most recent techniques face difficulties producing good results in challenging situations, e.g., if training data are scarce, out-of-distribution examples need to be segmented, or if objects are occluded. In such situations, the inclusion of (geometric) constraints can improve the segmentation quality significantly. In this paper, we study the constraint of the segmented region being segmented convex. Unlike prior work that encourages such a property with computationally expensive penalties on segmentation masks represented explicitly on a grid of pixels, our work is the first to consider an implicit representation. Specifically, we represent the segmentation as a parameterized function that maps spatial coordinates to the likeliness of a pixel belonging to the fore- or background. This enables us to provably ensure the convexity of the segmented regions with the help of input convex neural networks. Numerical experiments demonstrate how to encourage explicit and implicit representations to match in order to benefit from the convexity constraints in several challenging segmentation scenarios. |
Jan Philipp Schneider · Mishal Fatima · Jovita Lukasik · Andreas Kolb · Margret Keuper · Michael Moeller 🔗 |
-
|
Ecological data and objectives align deep neural network representations with humans
(
Spotlight
)
link »
The many successes of deep neural networks (DNNs) over the past decade have largely been driven by computational scale rather than insights from biological intelligence. While DNNs have nevertheless been surprisingly adept at explaining behavioral and neural recordings from humans, there is a growing number of reports indicating that DNNs are becoming progressively worse models of human vision as they improve on standard computer vision benchmarks. Here, we provide evidence that one path towards improving the alignment of DNNs with human vision is to train them with data and objective functions that more closely resemble those relied on by brains. We find that DNNs trained to capture the causal structure of large spatiotemporal object datasets learn generalizable object representations that exhibit smooth equivariance to 3-Dimensional (out-of-plane) variations in object pose and are predictive of human decisions and reaction times on popular psychophysics stimuli. Our work identifies novel data diets and objective functions that better align DNN vision with humans and can be easily scaled to generate the next generation of DNNs that behave as humans do. |
Akash Nagaraj · Alekh Karkada Ashok · Drew Linsley · Francis Lewis · Peisen Zhou · Thomas Serre 🔗 |
-
|
Visual Expertise Explains Image Inversion Effects
(
Spotlight
)
link »
We present an anatomically-inspired neurocomputational model, including a foveated retina and the log-polar mapping from the visual field to the primary visual cortex, that recreates image inversion effects long seen in psychophysical studies. We show that visual expertise, the ability to discriminate between subordinate-level categories, changes the performance of the model on inverted images. We first explore face discrimination, which, in humans, relies on configural information. The log-polar transform disrupts configural information in an inverted image and leaves featural information relatively unaffected. We suggest this is responsible for the degradation of performance with inverted faces. We then recreate the effect with other subordinate-level category discriminators and show that the inversion effect arises as a result of visual expertise, where configural information becomes relevant as more identities are learned at the subordinate-level. Our model matches the classic result: faces suffer more from inversion than mono-oriented objects, which are more disrupted than non-mono-oriented objects when objects are only familiar at a basic-level, and simultaneously shows that expert-level discrimination of other subordinate-level categories respond similarly to inversion as face experts. |
Martha Gahl · Shubham Kulkarni · Nikhil Pathak · Alex Russell · Gary Cottrell 🔗 |
-
|
How Good is a Single Basin?
(
Spotlight
)
link »
The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in the same basin. Through our experiments, we demonstrate that increased connectivity indeed negatively impacts performance. However, when incorporating the knowledge from other basins implicitly through distillation, we show that the gap in performance can be mitigated by re-discovering (multi-basin) deep ensembles in a single basin. Thus, we conjecture that while the extra-basin knowledge is at least partially present in any given basin, it cannot be easily harnessed without learning it from other basins. |
Kai Lion · Gregor Bachmann · Lorenzo Noci · Thomas Hofmann 🔗 |
-
|
Subjective Randomness and In-Context Learning
(
Spotlight
)
link »
Large language models (LLMs) exhibit intricate capabilities, often achieving high performance on tasks they were not explicitly trained for. The precise nature of LLM capabilities is often unclear, with different prompts eliciting different capabilities, especially when used with in-context learning (ICL). We propose a "Cognitive Interpretability" framework that enables us to analyze ICL dynamics to understand latent concepts underlying LLMs' behavioral patterns. This provides a more nuanced understanding than posthoc evaluation benchmarks, but does not require observing model internals as a mechanistic interpretation would require. Inspired by the cognitive science of human randomness perception, we use random binary sequences as context and study dynamics of ICL by manipulating properties of context data, such as sequence length. In the latest GPT-3.5+ models, we find emergent abilities to generate pseudo-random numbers and learn basic formal languages, with striking ICL dynamics where model outputs transition sharply from pseudo-random behaviors to deterministic repetition. |
Eric Bigelow · Ekdeep S Lubana · Robert Dick · Hidenori Tanaka · Tomer Ullman 🔗 |
-
|
Deep Multimodal Emotion Recognition using Modality Aware Attention Network for Unifying Representations in Neural Models
(
Spotlight
)
link »
This paper introduces a multi-modal emotion recognition system aimed at enhancing emotion recognition by integrating representations from physiological signals. To accomplish this goal, we introduce a modality aware attention network to extract emotion-specific features by influencing and aligning the representation spaces of various modalities into a unified entity. Through a series of experiments and visualizations conducted on the AMIGO dataset, we demonstrate the efficacy of our proposed methodology for emotion classification, highlighting its capability to provide comprehensive representations of physiological signals. |
Sungpil Woo · MUHAMMAD ZUBAIR · Sunhwan Lim · Daeyoung Kim 🔗 |
-
|
On Feature Learning of Recursive Feature Machines and Automatic Relevance Determination
(
Spotlight
)
link »
Feature learning is a crucial element for the performance of machine learning models. Recently, the exploration of feature learning in the context of kernel methods has led to the introduction of Recursive Feature Machines (RFMs). In this work, we connect diagonal RFMs to Automatic Relevance Determination (ARD) from the Gaussian process literature. We demonstrate that diagonal RFMs, similar to ARD, serve as a weighted covariate selection technique. However, they are trained using different paradigms: RFMs use recursive iterations of the so-called Average Gradient Outer Product, while ARD employs maximum likelihood estimation. Our experiments show that while the learned features in both models correlate highly across various tabular datasets, this correlation is lower for other datasets. Furthermore, we demonstrate that the RFM effectively captures correlation between covariates, and we present instances where the RFM outperforms both ARD and diagonal RFM. |
Daniel Gedon · Amirhesam Abedsoltan · Thomas Schön · Misha Belkin 🔗 |
-
|
Revisiting Supervision for Continual Representation Learning
(
Spotlight
)
link »
In the field of continual learning, models are designed to learn tasks one after the other. While most research has centered on supervised continual learning, recent studies have highlighted the strengths of self-supervised continual representation learning. The improved transferability of representations built with self-supervised methods is often associated with the role played by the multi-layer perceptron projector. In this work, we depart from this observation and reexamine the role of supervision in continual representation learning. We reckon that additional information, such as human annotations, should not deteriorate the quality of representations. Our findings show that supervised models when enhanced with a multi-layer perceptron head, can outperform self-supervised models in continual representation learning. |
Daniel Marczak · Sebastian Cygert · Tomasz Trzcinski · Bartłomiej Twardowski 🔗 |
-
|
Role Taxonomy of Units in Deep Neural Networks
(
Spotlight
)
link »
Identifying the role of network units in deep neural networks (DNNs) is critical in many aspects including giving understandings on the mechanisms of DNNs and building basic connections between deep learning and neuroscience. However, there remains unclear on which roles the units in DNNs with different generalization ability could present. To this end, we give role taxonomy of units in DNNs, where units are categorized into four types in terms of their functional preference on separately the training set and testing set. We show that ratios of the four categories are highly associated with the generalization ability of DNNs from two distinct perspectives, based on which we give signs of DNNs with well generalization. |
Yang Zhao · Hao Zhang · Xiuyuan Hu 🔗 |
-
|
How does fine-tuning affect your model? Mechanistic analysis on procedural tasks
(
Spotlight
)
link »
Fine-tuning large pre-trained models has become the de facto strategy for developing models that are safe to deploy. However, there has been little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in synthetic settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task. |
Samyak Jain · Robert Kirk · Ekdeep S Lubana · Robert Dick · Hidenori Tanaka · Tim Rocktäschel · Edward Grefenstette · David Krueger 🔗 |
-
|
Variational Classification
(
Spotlight
)
link »
We present variational classification (VC), a latent variable generalisation of neural network softmax classification under cross-entropy loss. Our approach provides a novel probabilistic interpretation of the highly familiar softmax classification model, to which it relates comparably to variational vs deterministic autoencoders. We derive a training objective based on the evidence lower bound (ELBO) that is non-trivial to optimize, and an adversarial approach to maximise it. We reveal an inherent inconsistency within softmax classification that VC addresses, while also allowing flexible choices of distributions in the latent space in place of assumptions implicit in standard softmax classifiers. Empirical evaluation demonstrates that VC maintains accuracy while improving properties such as calibration and adversarial robustness, particularly under distribution shift and low data settings. By explicitly considering representations learned by supervised methods, we offer the prospect of the principled merging of supervised learning with other representation learning methods, e.g.\ contrastive learning, using a common encoder architecture. |
Shehzaad Dhuliawala · Mrinmaya Sachan · Carl Allen 🔗 |
-
|
Hybrid Early Fusion for Multi-Modal Biomedical Representations
(
Spotlight
)
link »
Technological advances in medical data collection such as high-resolution histopathology and high-throughput genomic sequencing have contributed to the rising requirement for multi-modal biomedical modelling, specifically for image, tabular, and graph data. Most multi-modal deep learning approaches use modality-specific architectures that are trained separately and cannot capture the crucial cross-modal information that motivates the integration of different data sources. This paper presents the Hybrid Early-fusion Attention Learning Network (HEALNet) – a flexible multi-modal fusion architecture, which a) preserves modality-specific structural information, b) captures the cross-modal interactions and structural information in a shared latent space, c) can effectively handle missing modalities during training and inference, and d) enables intuitive model inspection by learning on the raw data input instead of opaque embeddings. We conduct multi-modal survival analysis on Whole Slide Images and Multi-omic data on four cancer cohorts of The Cancer Genome Atlas (TCGA). HEALNet achieves state-of-the-art performance, substantially improving over both uni-modal and recent multi-modal baselines, whilst being robust in scenarios with missing modalities. |
Konstantin Hemker · Nikola Simidjievski · Mateja Jamnik 🔗 |
-
|
Linear Mode Connectivity in Sparse Neural Networks
(
Spotlight
)
link »
With the rise in interest of sparse neural networks, we study how neural network pruning with synthetic data leads to sparse networks with unique training properties. We find that distilled data, a synthetic summarization of the real data, paired with Iterative Magnitude Pruning (IMP) unveils a new class of sparse networks that are more stable to SGD noise on the real data, than either the dense model, or subnetworks found with real data in IMP. That is, synthetically chosen subnetworks often train to the same minima, or exhibit linear mode connectivity. We study this through linear interpolation, loss landscape visualizations, and measuring the diagonal of the hessian. While dataset distillation as a field is still young, we find that these properties lead to synthetic subnetworks matching the performance of traditional IMP with up to 150x less training points in settings where distilled data applies. |
Luke McDermott · Daniel Cummings 🔗 |
-
|
SimVAE: Narrowing the gap between Discriminative & Generative Representation Learning
(
Spotlight
)
link »
Self-supervised representation learning is a powerful paradigm that leverages the relationship between semantically similar data, such as augmentations, extracts of an image or sound clip, or multiple views/modalities. Recent methods, e.g. SimCLR, CLIP and DINO, have made significant strides, yielding representations that achieve state-of-the-art results on multiple downstream tasks. Though often intuitive, a comprehensive theoretical understanding of their underlying mechanisms or what they learn eludes. Meanwhile, generative approaches, such as variational autoencoders (VAEs), fit a specific latent variable model and have principled appeal, but lag significantly in terms of performance. We present a theoretical analysis of self-supervised discriminative methods and a graphical model that reflects the assumptions they implicitly make and unifies these methods.We show that fitting this model under an ELBO objective improves representations over previous VAE methods on several common benchmarks, narrowing the gap to discriminative methods, and can also preserve information lost by discriminative approaches. This work brings new theoretical insight to modern machine learning practice. |
Alice Bizeul · Carl Allen 🔗 |
-
|
Universality of intrinsic dimension of latent representations across models
(
Spotlight
)
link »
While state-of-the-art transformer networks use several hundreds of latent variables per layer, it has been shown that these features can actually be represented by relatively low dimensional manifolds. The intrinsic dimension is a geometrical property of the manifold latent representations populate, viz., a minimal number of parameters needed to describe the representations. In this work, we compare the intrinsic dimensions of three image transformer networks for classes of the cifar10 and cifar100 dataset. We find compelling evidence that the intrinsic dimensions differ among classes but are universal across networks. This universality persists across different pretraining strategies, fine-tuning and different model sizes. Our results strengthen the hypothesis that different models learn similar representations of data and show great potential that further investigation of intrinsic dimension could lead to more insights on the universality of latent representations. |
Teresa Scheidt · Lars Kai Hansen 🔗 |
-
|
On consequences of finetuning on data with highly discriminative features
(
Spotlight
)
link »
In the era of transfer learning, training neural networks from scratch is becoming obsolete. Transfer learning leverages prior knowledge for new tasks, conserving computational resources. While its advantages are well-documented, we uncover a notable drawback: networks tend to prioritize basic data patterns, forsaking valuable pre-learned features. We term this behavior "feature erosion" and analyze its impact on network performance and internal representations. |
Wojciech Masarczyk · Tomasz Trzcinski · Mateusz Ostaszewski 🔗 |
-
|
SHARCS: Shared Concept Space for\\Explainable Multimodal Learning
(
Spotlight
)
link »
Multimodal learning is an essential paradigm for addressing complex real-world problems, where individual data modalities are typically insufficient for accurately solving a given modelling task. While various deep learning approaches have successfully addressed these challenges, their reasoning process is often opaque; limiting the capabilities for a principled explainable cross-modal analysis and any domain-expert intervention. In this paper, we introduce SHARCS (SHARed Concept Space) -- a novel concept-based approach for explainable multimodal learning. SHARCS learns and maps interpretable concepts from different heterogeneous modalities into a single unified concept-manifold, which leads to an intuitive projection of semantically similar cross-modal concepts. We demonstrate that such an approach can lead to inherently explainable task predictions while also improving downstream predictive performance. Moreover, we show that SHARCS can operate and significantly outperform other approaches in practically significant scenarios, such as retrieval of missing modalities and cross-modal explanations. Our approach is model agnostic and easily applicable to different types (and number) of modalities, thus advancing the development of effective, interpretable, and trustworthy multimodal approaches. |
Gabriele Dominici · Pietro Barbiero · Lucie Charlotte Magister · Pietro Lió · Nikola Simidjievski 🔗 |
-
|
Sufficient conditions for offline reactivation in recurrent neural networks
(
Spotlight
)
link »
During periods of quiescence, such as sleep, neural activity in many brain circuits resembles that observed during periods of task engagement. However, the precise conditions under which task-optimized networks can autonomously reactivate the same network states responsible for online behavior is poorly understood. In this study, we develop a mathematical framework that outlines sufficient conditions for the emergence of neural reactivation in circuits that encode features of smoothly varying stimuli. We demonstrate mathematically that noisy recurrent networks optimized to track environmental state variables using change-based sensory information naturally develop denoising dynamics, which, in the absence of input, cause the network to revisit state configurations observed during periods of online activity. We validate our findings using numerical experiments on two canonical neuroscience tasks: spatial position estimation based on self-motion cues, and head direction estimation based on angular velocity cues. Overall, our work provides theoretical support for modeling offline reactivation as an emergent consequence of task optimization in noisy neural circuits. |
Nanda Krishna · Colin Bredenberg · Daniel Levenstein · Blake Richards · Guillaume Lajoie 🔗 |
-
|
On the universality of neural codes in vision
(
Spotlight
)
link »
A high level of similarity between neural codes of natural images has been reported for both biological and artificial brains. These observations beg the question whether this similarity of representations stems from a more fundamental similarity between neural coding strategies. In this paper, we show that neural networks trained on different image classification datasets learn similar weight summary statistics. Our results reveal the existence of a universal neural code for natural images. |
Florentin Guth · Brice Ménard 🔗 |
-
|
Event-Based Contrastive Learning for Medical Time Series
(
Spotlight
)
link »
In clinical practice, one often needs to identify whether a patient is at high risk of adverse outcomes after some key medical event; e.g., the short-term risk of death after an admission for heart failure. This task, however, remains challenging due to the complexity, variability, and heterogeneity of longitudinal medical data, especially for individuals suffering from chronic diseases like heart failure. In this paper, we introduce Event-Based Contrastive Learning (EBCL) - a method for learning embeddings of heterogeneous patient data that preserves temporal information before and after key index events. We demonstrate that EBCL produces models that yield better fine-tuning performance on critical downstream tasks including 30-day readmission, 1-year mortality, and 1-week length of stay relative to other representation learning methods that do not exploit temporal information surrounding key medical events. |
Hyewon Jeong · Nassim Oufattole · Aparna Balagopalan · Matthew McDermott · Payal Chandak · Marzyeh Ghassemi · Collin Stultz 🔗 |
-
|
Multi-timescale reinforcement learning in the brain
(
Spotlight
)
link »
To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behavior can be learned through reinforcement learning1, a class of algorithms that has been successful at training artificial agents and at characterizing the firing of dopamine neurons in the midbrain. In classical reinforcement learning, agents discount future rewards exponentially according to a single time scale, known as the discount factor. This strategy is at the odds with the empirical observation that humans and animals use non-exponential discounts in many situations. Here, we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopamine neurons in mice performing two behavioral tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks suggesting that it is a cell-specific property. Together, our results provide a new paradigm to understand functional heterogeneity in dopamine neurons, and open new avenues for the design of more efficient reinforcement learning algorithms. |
Paul Masset · Pablo Tano · HyungGoo Kim · Athar Malik · Alexandre Pouget · Naoshige Uchida 🔗 |
Author Information
Marco Fumero (Sapienza, University of Rome)
Emanuele Rodolà (Sapienza University of Rome)
Francesco Locatello (AWS)
Gintare Karolina Dziugaite (Google Research, Brain Team)
Mathilde Caron (Google)
More from the Same Authors
-
2021 Spotlight: Towards a Unified Information-Theoretic Framework for Generalization »
Mahdi Haghifam · Gintare Karolina Dziugaite · Shay Moran · Dan Roy -
2021 : Stochastic Pruning: Fine-Tuning, and PAC-Bayes bound optimization »
Soufiane Hayou · Bobby He · Gintare Karolina Dziugaite -
2021 : The Dynamics of Functional Diversity throughout Neural Network Training »
Lee Zamparo · Marc-Etienne Brunet · Thomas George · Sepideh Kharaghani · Gintare Karolina Dziugaite -
2022 : Unmasking the Lottery Ticket Hypothesis: Efficient Adaptive Pruning for Finding Winning Tickets »
Mansheej Paul · Feng Chen · Brett Larsen · Jonathan Frankle · Surya Ganguli · Gintare Karolina Dziugaite -
2022 : Neural Implicit Style-net: synthesizing shapes in a preferred style exploiting self supervision »
Marco Fumero · Hooman Shayani · Aditya Sanghi · Emanuele Rodolà -
2022 : The Effect of Data Dimensionality on Neural Network Prunability »
Zachary Ankner · Alex Renda · Gintare Karolina Dziugaite · Jonathan Frankle · Tian Jin -
2022 : A General-Purpose Neural Architecture for Geospatial Systems »
Martin Weiss · Nasim Rahaman · Frederik Träuble · Francesco Locatello · Alexandre Lacoste · Yoshua Bengio · Erran Li Li · Chris Pal · Bernhard Schölkopf -
2023 : A Sparsity Principle for Partially Observable Causal Representation Learning »
Danru Xu · Dingling Yao · Sébastien Lachapelle · Perouz Taslakian · Julius von Kügelgen · Francesco Locatello · Sara Magliacane -
2023 : Multi-View Causal Representation Learning with Partial Observability »
Dingling Yao · Danru Xu · Sébastien Lachapelle · Sara Magliacane · Perouz Taslakian · Georg Martius · Julius von Kügelgen · Francesco Locatello -
2023 Workshop: 4th Workshop on Self-Supervised Learning: Theory and Practice »
Tengda Han · Ishan Misra · Pengtao Xie · Mathilde Caron · Hilde Kuehne -
2023 : Invited talk by Francesco Locatello (ISTA) »
Francesco Locatello -
2023 Poster: Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling »
Zhenyu Zhu · Francesco Locatello · Volkan Cevher -
2023 Poster: Latent Space Translation via Semantic Alignment »
Valentino Maiorca · Luca Moschella · Antonio Norelli · Marco Fumero · Francesco Locatello · Emanuele Rodolà -
2023 Poster: ASIF: Coupled Data Turns Unimodal Models to Multimodal without Training »
Antonio Norelli · Marco Fumero · Valentino Maiorca · Luca Moschella · Emanuele Rodolà · Francesco Locatello -
2023 Poster: Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution »
Mostafa Dehghani · Basil Mustafa · Josip Djolonga · Jonathan Heek · Matthias Minderer · Mathilde Caron · Andreas Steiner · Joan Puigcerver · Robert Geirhos · Ibrahim Alabdulmohsin · Avital Oliver · Piotr Padlewski · Alexey Gritsenko · Mario Lucic · Neil Houlsby -
2023 Poster: Assumption violations in causal discovery and the robustness of score matching »
Francesco Montagna · Atalanti Mastakouri · Elias Eulig · Nicoletta Noceti · Lorenzo Rosasco · Dominik Janzing · Bryon Aragam · Francesco Locatello -
2023 Poster: Leveraging sparse and shared feature activations for disentangled representation learning »
Marco Fumero · Florian Wenzel · Luca Zancato · Alessandro Achille · Emanuele Rodolà · Stefano Soatto · Bernhard Schölkopf · Francesco Locatello -
2023 Poster: Rotating Features for Object Discovery »
Sindy Löwe · Phillip Lippe · Francesco Locatello · Max Welling -
2023 Oral: Rotating Features for Object Discovery »
Sindy Löwe · Phillip Lippe · Francesco Locatello · Max Welling -
2023 : Exploiting emerging similarities by infusing invariances in neural representations »
Irene Cannistraci · Marco Fumero · Luca Moschella · Valentino Maiorca · Emanuele Rodolà -
2022 Poster: Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks »
Mansheej Paul · Brett Larsen · Surya Ganguli · Jonathan Frankle · Gintare Karolina Dziugaite -
2022 Poster: Pruning’s Effect on Generalization Through the Lens of Training and Regularization »
Tian Jin · Michael Carbin · Dan Roy · Jonathan Frankle · Gintare Karolina Dziugaite -
2022 Poster: Reduced Representation of Deformation Fields for Effective Non-rigid Shape Matching »
Ramana Subramanyam Sundararaman · Riccardo Marin · Emanuele Rodolà · Maks Ovsjanikov -
2021 Poster: Dynamic Inference with Neural Interpreters »
Nasim Rahaman · Muhammad Waleed Gondal · Shruti Joshi · Peter Gehler · Yoshua Bengio · Francesco Locatello · Bernhard Schölkopf -
2021 Poster: Deep Learning on a Data Diet: Finding Important Examples Early in Training »
Mansheej Paul · Surya Ganguli · Gintare Karolina Dziugaite -
2021 Poster: Towards a Unified Information-Theoretic Framework for Generalization »
Mahdi Haghifam · Gintare Karolina Dziugaite · Shay Moran · Dan Roy -
2021 Poster: Backward-Compatible Prediction Updates: A Probabilistic Approach »
Frederik Träuble · Julius von Kügelgen · Matthäus Kleindessner · Francesco Locatello · Bernhard Schölkopf · Peter Gehler -
2021 Poster: Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style »
Julius von Kügelgen · Yash Sharma · Luigi Gresele · Wieland Brendel · Bernhard Schölkopf · Michel Besserve · Francesco Locatello -
2020 : Keynote 5: Gintare Karolina Dziugaite »
Gintare Karolina Dziugaite -
2020 Poster: Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel »
Stanislav Fort · Gintare Karolina Dziugaite · Mansheej Paul · Sepideh Kharaghani · Daniel Roy · Surya Ganguli -
2020 Poster: Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms »
Mahdi Haghifam · Jeffrey Negrea · Ashish Khisti · Daniel Roy · Gintare Karolina Dziugaite -
2020 Poster: In search of robust measures of generalization »
Gintare Karolina Dziugaite · Alexandre Drouin · Brady Neal · Nitarshan Rajkumar · Ethan Caballero · Linbo Wang · Ioannis Mitliagkas · Daniel Roy -
2019 : Lunch Break and Posters »
Xingyou Song · Elad Hoffer · Wei-Cheng Chang · Jeremy Cohen · Jyoti Islam · Yaniv Blumenfeld · Andreas Madsen · Jonathan Frankle · Sebastian Goldt · Satrajit Chatterjee · Abhishek Panigrahi · Alex Renda · Brian Bartoldson · Israel Birhane · Aristide Baratin · Niladri Chatterji · Roman Novak · Jessica Forde · YiDing Jiang · Yilun Du · Linara Adilova · Michael Kamp · Berry Weinstein · Itay Hubara · Tal Ben-Nun · Torsten Hoefler · Daniel Soudry · Hsiang-Fu Yu · Kai Zhong · Yiming Yang · Inderjit Dhillon · Jaime Carbonell · Yanqing Zhang · Dar Gilboa · Johannes Brandstetter · Alexander R Johansen · Gintare Karolina Dziugaite · Raghav Somani · Ari Morcos · Freddie Kalaitzis · Hanie Sedghi · Lechao Xiao · John Zech · Muqiao Yang · Simran Kaur · Qianli Ma · Yao-Hung Hubert Tsai · Ruslan Salakhutdinov · Sho Yaida · Zachary Lipton · Daniel Roy · Michael Carbin · Florent Krzakala · Lenka Zdeborová · Guy Gur-Ari · Ethan Dyer · Dilip Krishnan · Hossein Mobahi · Samy Bengio · Behnam Neyshabur · Praneeth Netrapalli · Kris Sankaran · Julien Cornebise · Yoshua Bengio · Vincent Michalski · Samira Ebrahimi Kahou · Md Rifat Arefin · Jiri Hron · Jaehoon Lee · Jascha Sohl-Dickstein · Samuel Schoenholz · David Schwab · Dongyu Li · Sang Choe · Henning Petzka · Ashish Verma · Zhichao Lin · Cristian Sminchisescu -
2019 Workshop: Machine Learning with Guarantees »
Ben London · Gintare Karolina Dziugaite · Daniel Roy · Thorsten Joachims · Aleksander Madry · John Shawe-Taylor -
2019 Poster: Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates »
Jeffrey Negrea · Mahdi Haghifam · Gintare Karolina Dziugaite · Ashish Khisti · Daniel Roy -
2018 Poster: Data-dependent PAC-Bayes priors via differential privacy »
Gintare Karolina Dziugaite · Daniel Roy