Timezone: »
The 4th Workshop on "Self-Supervised Learning: Theory and Practice" aims to discuss the theory and practice of self-supervised learning across multiple research areas like vision, NLP \& robotics.
Sat 7:00 a.m. - 7:15 a.m.
|
Welcome and introduction
(
Opening
)
|
🔗 |
Sat 10:00 a.m. - 11:00 a.m.
|
Poster Session I
(
Poster
)
|
🔗 |
Sat 2:10 p.m. - 3:00 p.m.
|
Poster Session II
(
Poster
)
|
🔗 |
-
|
Improving Domain Generalization in Contrastive Learning Using Adaptive Temperature Control
(
Poster
)
link »
Self-supervised pre-training with contrastive learning is a powerful method for learning from sparsely labeled data. However, performance can drop considerably when there is a shift in the distribution of data from training to test time. We study this phenomenon in a setting in which the training data come from multiple domains, and the test data come from a domain not seen at training that is subject to significant covariate shift. We present a new method for contrastive learning that incorporates domain labels to increase the domain invariance of learned representations, leading to improved out-of-distribution generalization. Our method adjusts the temperature parameter in the InfoNCE loss -- which controls the relative weighting of negative pairs -- using the probability that a negative sample comes from the same domain as the anchor. This upweights pairs from more similar domains, encouraging the model to discriminate samples based on domain-invariant attributes. Through experiments on a variant of the MNIST dataset, we demonstrate that our method yields better out-of-distribution performance than domain generalization baselines. Furthermore, our method maintains strong in-distribution task performance, substantially outperforming baselines on this measure. |
Katie Matton · Robert Lewis · Rosalind Picard · John Guttag 🔗 |
-
|
Visualizing the loss landscape of Self-supervised Vision Transformer
(
Poster
)
link »
The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored.In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed which adopts a self-distillation scheme in the form of an exponential moving average (EMA) teacher into MAE, and it has been shown that the EMA-teacher performs a conditional gradient correction during optimization. To further investigate the reason for better generalization of the self-supervised ViT when trained by MAE (MAE-ViT) and the effect of the gradient correction of RC-MAE from the perspective of optimization, we visualize the loss landscapes of the self-supervised vision transformer by both MAE and RC-MAE and compare them with the supervised ViT (Sup-ViT). Unlike previous loss landscape visualizations of neural networks based on classification task loss, we visualize the loss landscape of ViT by computing pre-training task loss. Through the lens of loss landscapes, we find two interesting observations: (1) MAE-ViT has a smoother and wider overall loss curvature than Sup-ViT. (2) The EMA-teacher allows MAE to widen the region of convexity in both pretraining and linear probing, leading to quicker convergence.To the best of our knowledge, this work is the first to investigate the self-supervised ViT through the lens of the loss landscape. |
Youngwan Lee · Jeffrey Willette · Jonghee Kim · Sung Ju Hwang 🔗 |
-
|
Adversarial perturbation based latent reconstruction for domain-agnostic self-supervised learning
(
Poster
)
link »
Most self-supervised learning (SSL) methods rely on domain-specific pretext tasks and data augmentations to learn high-quality representations from unlabeled data. Development of those pretext tasks and data augmentations requires expert domain knowledge. In addition, it is not clear why solving certain pretext tasks leads to useful representations. Those two reasons hinder wider application of SSL to different domains. To overcome such limitations, we propose adversarial perturbation based latent reconstruction (APLR) for domain-agnostic self-supervised learning. In APLR, a neural network is trained to generate adversarial noise to perturb the unlabeled training sample so that domain-specific augmentations are not required. The pretext task in APLR is to reconstruct the latent representation of a clean sample from a perturbed sample. We show that representation learning via latent reconstruction is closely related to multi-dimensional Hirschfeld-Gebelein-Rényi (HGR) maximal correlation and has theoretical guarantees on the linear probe error. To demonstrate the effectiveness of APLR, the proposed method is applied to various domains such as tabular data, images, and audios. Empirical results indicate that APLR not only outperforms existing domain-agnostic SSL methods, but also closes the performance gap to domain-specific SSL methods. In many cases, APLR also outperforms training the full network in a supervised manner. |
Kuilin Chen · Sijie Tian · Chi-Guhn Lee 🔗 |
-
|
Evolving Graph Generalization Estimation via Self-Supervised Learning
(
Poster
)
link »
Graph Neural Networks are widely deployed in vast fields, but they often struggle to maintain accurate representations as graphs evolve. We theoretically establish a lower bound, proving that under mild conditions, representation distortion inevitably occurs over time. To estimate the temporal representation distortion without human annotation after deployment, one naive approach is to pre-train a recurrent model before deployment and use this model afterwards, but the estimation is far from satisfactory. In this paper, we analyze the representation distortion from an information theory perspective, and attribute it primarily to inaccurate feature extraction during evolution.Consequently, we introduce Smart, a straightforward and effective baseline enhanced by an adaptive feature extractor through self-supervised graph reconstruction. Experimental results on real-world evolving graphs demonstrate our outstanding performance, especially the necessity of self-supervised graph reconstruction. For example, on OGB-arXiv dataset, the estimation metric MAPE deteriorates from 2.19\% to 8.00\% without reconstruction. |
Bin Lu · Tingyan Ma · Xiaoying Gan · Luoyi Fu · Xinbing Wang · Chenghu Zhou · Shiyu Liang 🔗 |
-
|
Uncertainty Quantification using Deep Ensembles for Safety-Critical Predictive Models
(
Poster
)
link »
This paper introduces a novel approach for uncertainty quantification in safety-critical predictive models by using a deep ensemble model, hence addressing a critical problem in predictive maintenance tasks. It builds a regression model to predict the Remaining Useful Life (RUL) of aircraft engines, utilizing the well-known run-to-failure turbo engine degradation dataset. Addressing the overlooked yet crucial aspect of uncertainty estimation in previous research, this paper revamps the LSTM architecture to facilitate uncertainty estimates, employing Negative Log Likelihood (NLL) as the training criterion. Through a series of experiments, the model demonstrated self-awareness of its uncertainty levels, correlating high confidence with low prediction errors and vice versa. This initiative not only enhances predictive maintenance strategies but also significantly improves the safety and reliability of aviation assets by offering a more nuanced understanding of predictive uncertainties. To the best of our knowledge, this is pioneering work in this application domain. |
Oishi Deb · Emmanouil Benetos · Philip Torr 🔗 |
-
|
WERank: Rank Degradation Prevention for Self-Supervised Learning via Weight Regularization
(
Poster
)
link »
A common phenomenon in self-supervised learning is dimensional collapse (also known as rank degeneration), where the learned embeddings are mapped to a low dimensional subspace of the embedding space. Despite employing mechanisms to prevent dimensional collapse, previous self-supervised approaches have not succeeded in completely alleviating the problem. We propose WERank, a new regularizer on the weight parameters of the neural network encoder to prevent rank degeneration. Our regularization term can be applied on top of any existing self-supervised method without significant computational cost. We provide empirical and mathematical evidence to demonstrate the effectiveness of WERank in avoiding dimensional collapse. |
Ali Saheb Pasand · Reza Moravej · Mahdi Biparva · Ali Ghodsi 🔗 |
-
|
Language-Conditioned Semantic Search-Based Policy for Robotic Manipulation Tasks
(
Poster
)
link »
Solving various robotic manipulation tasks intelligently is a topic of great interest. Traditional reinforcement learning and imitation learning approaches require policy learning utilizing complex strategies that are difficult to generalize well. In this work, we propose a language-conditioned semantic search-based method to produce an online search-based policy from the available demonstration dataset of state-action trajectories. Here we directly acquire actions from the most similar manipulation trajectories found in the dataset. Our approach surpasses the performance of the baselines on the CALVIN benchmark and exhibits strong zero- shot adaptation capabilities. This holds great potential for expanding the use of our online search-based policy approach to tasks typically addressed by Imitation Learning or Reinforcement Learning-based policies. |
Jannik Sheikh · Andrew Melnik · Gora Chand Nandi · Robert Haschke 🔗 |
-
|
MeSa: Masked, Geometric, and Supervised Pre-training for Monocular Depth Estimation
(
Poster
)
link »
Self-supervised learning (SSL) has been an important ingredient in developing strong monocular depth estimation models in recent years. However, despite these improvements, our study reveals that the later layers of the SOTA SSL method are actually suboptimal. By examining the layer-wise representations, we demonstrate significant changes in these later layers during fine-tuning, indicating the ineffectiveness of their pre-trained features for depth estimation. To address these limitations, we propose MeSa, a comprehensive framework that leverages the complementary strengths of masked, geometric, and supervised pre-training. Our CKA layer-wise analysis confirms that our pre-training strategy indeed produces improved representations for the later layers, overcoming the drawbacks of the SOTA SSL method. Furthermore, via experiments on the NYUv2 and IBims-1 datasets, we demonstrate that these enhanced representations translate to performance improvements in both the in-distribution and out-of-distribution settings. We also investigate the influence of the pre-training dataset and demonstrate the efficacy of pre-training on LSUN, which yields significantly better pre-trained representations. Overall, our approach surpasses the masked pre-training SSL method by a substantial margin of 17.1% on the RMSE. Moreover, even without utilizing any recently proposed techniques, MeSa also outperforms the most recent methods and establishes a new state-of-the-art for monocular depth estimation on the challenging NYUv2 dataset. |
Muhammad Osama Khan · Junbang Liang · Chun-Kai Wang · Shan Yang · Yu Lou 🔗 |
-
|
DAPO: Self-Supervised Domain Adaptation for 6DoF Pose Estimation
(
Poster
)
link »
The main challenge of pose estimation for six degrees of freedom (6DoF) is the lack of labeled data in real environment. In order to overcome this problem, many studies recently have trained deep learning models with synthetic data. However, a domain gap between real and synthetic environments exists, prompting various approaches to address this issue. In this work, we propose domain adaptation for self-supervised 6DoF pose estimation (DAPO), which leverages the components and introduces an effective method to reduce domain discrepancy. First, we adopt a multi-level domain adaptation module, on image level and instance level, to learn domain-invariant features. Second, we used entropy-based alignment to minimize the entropy of representation embedding. Finally, we evaluate our approach on LineMOD and Occlusion-LineMOD datasets. Experiments show that our proposed method achieves higher performance compared to the prior methods and demonstrate effectiveness in domain shift scenarios on 6DoF pose estimation. |
Juseong Jin · Eunju Jeong · Joonmyun Cho · JUN HEE PARK · Young-Gon Kim 🔗 |
-
|
Leveraging Uniformity of Normalized Embeddings for Sequential Recommendation
(
Poster
)
link »
Pointwise loss is one of the most widely adopted yet practical choices for training sequential recommendation models. Aside from their successes, only limited studies leverage normalized embeddings in their optimization, which has been actively explored and proven effective in various machine learning fields. However, we observe that the na\"ive adoption of normalization hinders the quality of a learned recommendation policy. In particular, we argue that the clusterization of embeddings on a unit hypersphere triggers such performance degradation. To alleviate this issue, we propose a novel training objective that enforces the uniformity of embeddings while learning the recommendation policy. We empirically validate our method on sequential recommendation tasks and show superior performance improvements compared to other approaches without normalization. |
Hyunsoo Chung · Jungtaek Kim 🔗 |
-
|
Self-Supervised Learning Meets Liver Ultrasound Imaging
(
Poster
)
link »
In the field of medical ultrasound imaging, conventional B-mode ``grey scale'' ultrasound and shear wave elastography (SWE) are widely used for chronic liver disease diagnosis and risk stratification. However, many abdominal ultrasound images do not include views of the liver, necessitating a pre-processing liver view detection step before feeding the image to the AI system. To address this, we propose a self-supervised learning method, SimCLR+LR, for image classification that utilizes a large set of unlabeled abdominal ultrasound images to learn image representations. These representations are then fine-tuned to the downstream task of liver view classification. This approach outperforms traditional supervised learning methods and achieves superior performance when compared to state-of-the-art (SOTA) models, ResNet-18 and MLP-Mixer. Once the liver view is detected, the next crucial phase involves the segmentation of the liver region, imperative for obtaining accurate and dependable results in SWE. For this, we present another self-supervised learning approach, SimCLR+ENet, which leverages the learned feature representations and fine-tunes them on the task of liver segmentation, followed by a refinement step using CascadePSP. The proposed approach outperforms the SOTA method U-Net. SimCLR+ENet was also used to detect poor probe contact (i.e., areas where the ultrasound probe/transducer does not have adequate contact with the patient's skin) in liver ultrasound images, an artifact that affects the reliability of SWE. The combination of the proposed self-supervised learning methods for liver view classification, liver segmentation, and poor probe contact detection not only reduces the time and cost associated with data labeling, but also optimizes the liver segmentation workflow and SWE reliability in a real-time setting. |
Abder-Rahman Ali · Anthony E Samir 🔗 |
-
|
Posterior Sampling on Simsiam: Rethinking Optimization in Siamese Self-Supervised Learning
(
Poster
)
link »
Chen & He (2020) states that self-supervised pre-training can be performed without contrastive learning (CL) (i.e., using negative pairs). Rather, the proposed approach (SimSiam) merely maximizes the similarity between two transformations of the same example. Interestingly, even though a global optimum for this task is to collapse SimSiam into a constant function ignoring input, Chen & He (2020) argues that, in practice, the training converges to non-global optima yielding useful representations of the input. A key component is a stop-gradient (SG) operation which, if not used, causes SimSiam to quickly collapse to the global optimum. In this work, we investigate whether SG is genuinely indispensable or if satisfactory outcomes can be achieved by better exploring the loss landscape. Namely, we keep the loss landscape intact by not changing SimSiam's architecture, and explore it with SGHMC (Chen et al., 2014), a sampling method known for efficiently covering distant regions of the posterior distribution. Our empirical finding is that the proposed samples of the posterior never reach collapsed points for properly chosen step-sizes of SGHMC, indicating a large room for future optimization methods other than SG that could avoid collapse. Although SGHMC turns out not as effective as SG for improving accuracy in the downstream task, we believe our results beg more investigation about the actual necessity of SG. |
Daniel De Mello · Ruqi Zhang · Bruno Ribeiro 🔗 |
-
|
No Free Lunch in Self Supervised Representation Learning
(
Poster
)
link »
Self-supervised representation learning in computer vision heavily relies on hand-crafted image transformations to derive meaningful, invariant features. Yet, the literature has limited explorations on the impact of transformation design. This work delves into this relationship, particularly its effect on domains beyond natural images. We posit that transformation design acts as beneficial supervision. We establish that transformations influence representation features and clustering relevance, and further probe transformation design's effect on microscopy images, where class differences are subtler than in natural images, leading to more pronounced impacts on encoded features. Conclusively, we showcase that careful transformation selection, based on desired features, enhances performance by refining the resulting representation. |
Ihab Bendidi · Adrien Bardes · ethan cohen · Alexis Lamiable · Guillaume Bollot · Auguste Genovesio 🔗 |
-
|
MOFO: MOtion FOcused Self-Supervision for Video Understanding
(
Poster
)
link »
Self-supervised learning (SSL) techniques have recently produced outstanding results in learning visual representations from unlabeled videos. However, despite the importance of motion in supervised learning techniques for action recognition, SSL methods often do not explicitly consider motion information in videos. To address this issue, we propose MOFO (MOtion FOcused), a novel SSL method for focusing representation learning on the motion area of a video for action recognition. MOFO automatically detects motion areas in videos and uses these to guide the self-supervision task. We use a masked autoencoder that randomly masks out a high proportion of the input sequence and forces a specified percentage of the inside of the motion area to be masked and the remainder from outside. We further incorporate motion information into the finetuning step to emphasise motion in the downstream task. We demonstrate that our motion-focused innovations can significantly boost the performance of the currently leading SSL method (VideoMAE) for action recognition. Our proposed approach significantly improves the performance of the current SSL method for action recognition, indicating the importance of explicitly encoding motion in SSL. |
Mona Ahmadian · Frank Guerin · Andrew Gilbert 🔗 |
-
|
Learning to Embed Time Series Patches Independently
(
Poster
)
link »
Conventional masked time series modeling patchify and partially mask out time series (TS), and then train Transformers to capture the dependencies between patches by predicting masked patches from unmasked patches. However, we argue that capturing such dependencies might not be an optimal strategy for TS representation learning; rather, embedding patches independently results in better representations. Specifically, we propose to use 1) the patch reconstruction task, autoencoding each patch without looking at other patches, and 2) the MLP that embeds each patch independently. In addition, we introduce complementary contrastive learning to hierarchically capture adjacent TS information efficiently. Our proposed method improves various tasks compared to state-of-the-art Transformer-based models, while it is more efficient in terms of the number of parameters and training time. |
Seunghan Lee · Taeyoung Park · Kibok Lee 🔗 |
-
|
Unsupervised Segmentation of Colonoscopy Images
(
Poster
)
link »
Colonoscopy plays a crucial role in the diagnosis and prognosis of various gastrointestinal diseases. Due to the challenges of collecting large-scale high-quality ground truth annotations for colonoscopy images, and more generally medical images, we explore using self-supervised features from vision transformers in three challenging tasks for colonoscopy images. Our results indicate that image-level features learned from DINO models achieve image classification performance comparable to fully supervised models, and patch-level features contain rich semantic information for object detection. Furthermore, we demonstrate that self-supervised features combined with unsupervised segmentation can be used to ‘discover’ multiple clinically relevant structures in a fully unsupervised manner, demonstrating the tremendous potential of applying these methods in medical image analysis. |
Heming Yao · Jérôme Lüscher · Benjamin Gutierrez Becker · Josep Arús-Pous · Tommaso Biancalani · Amelie Bigorgne · David Richmond 🔗 |
-
|
A Simple Framework for Self-Supervised Learning of Sample-Efficient World Models
(
Poster
)
link »
Deep reinforcement learning algorithms suffer from low sample efficiency, which is addressed in recent approaches by building a world model and learning behaviors in imagination. We present a simple framework for self-supervised learning of world models inspired by VICReg, requiring neither image reconstructions nor specific neural network architectures. The learned representations are temporally consistent, which facilitates next state prediction and leads to good generalization properties for the policy and the value function. We build a world model for Atari consisting only of feedforward layers that is easy to implement and allows fast training and inference. By learning behaviors in imagination, we evaluate our method on the Atari 100k benchmark. |
Jan Robine · Marc Höftmann · Stefan Harmeling 🔗 |
-
|
Learning Beyond Similarities: Incorporating Dissimilarities between Positive Pairs in Self-Supervised Time Series Learning.
(
Poster
)
link »
By identifying similarities between successive inputs, Self-Supervised Learning (SSL) methods for time series analysis have demonstrated their effectiveness in encoding the inherent static characteristics of temporal data. However, an exclusive emphasis on similarities might result in representations that overlook the dynamic attributes critical for modelling cardiovascular diseases within a confined subject cohort. Introducing Distilled Encoding Beyond Similarities (DEBS), this paper pioneers an SSL approach that transcends mere similarities by integrating dissimilarities among positive pairs. The framework is applied to electrocardiogram (ECG) signals, leading to a notable enhancement of +10% in the detection accuracy of Atrial Fibrillation (AFib) across diverse subjects. DEBS underscores the potential of attaining a more refined representation by encoding the dynamic characteristics of time series data, tapping into dissimilarities during the optimization process. Broadly, the strategy delineated in this study holds the promise of unearthing novel avenues for advancing SSL methodologies tailored to temporal data. |
Adrian Atienza · Jakob Bardram · Sadasivan Puthusserypady 🔗 |
-
|
Learning Orthonormal Features in Self-Supervised Learning using Functional Maximal Correlation
(
Poster
)
link »
This paper applies statistical dependence measures to interpret self-supervised learning (SSL). Conventional applications of measures like mutual information commonly use separate procedures for feature extraction and dependence estimation, where the relationship between optimal features and the strength of dependence is unclear. This causes limitations in tasks requiring multivariate feature representations, particularly in SSL. The recently introduced multivariate measure, functional maximal correlation, is a unified framework based on orthonormal decomposition of density ratios, wherein the spectrum and the bases become the measure and the features, respectively. This paper proposes that features in SSL can also be interpreted as basis functions of the density ratio. We introduce the Hierarchical Functional Maximal Correlation Algorithm (HFMCA), a theoretically justified approach that ensures faster convergence, enhanced stability, and prevents feature collapse by learning orthonormal bases as multivariate features. |
Bo Hu · Yuheng Bu · Jose C Principe 🔗 |
-
|
Enhancing CLIP with a Third Modality
(
Poster
)
link »
We study the problem of training a third tower for a new modality given a pre-trained CLIP model. This extra part of the architecture can be used to incorporate other modalities in the model pipeline. In our setting, we consider the use of a model such as BLIP-2, which provides us with a dialogue centered around the image. We evaluate our model in the setting of image and text retrieval, and compare it against the regular image and text based one. |
Efthymios Tsaprazlis · Georgios Smyrnis · Alex Dimakis · Petros Maragos 🔗 |
-
|
Generalization properties of contrastive world models
(
Poster
)
link »
Recent work on object-centric world models aim to factorize representations in terms of objects in a completely unsupervised or self-supervised manner. Such world models are hypothesized to be a key component to address the generalization problem. While self-supervision has shown improved performance however, OOD generalization has not been systematically and explicitly tested. In this paper, we conduct an extensive study on the generalization properties of contrastive world model. We systematically test the model under a number of different OOD generalization scenarios such as extrapolation to new object attributes, introducing new conjunctions or new attributes. Our experiments show that the contrastive world model fails to generalize under the different OOD tests and the drop in performance depends on the extent to which the samples are OOD. When visualizing the transition updates and convolutional feature maps, we observe that any changes in object attributes (such as previously unseen colors, shapes, or conjunctions of color and shape) breaks down the factorization of object representations. Overall, our work highlights the importance of object-centric representations for generalization and current models are limited in their capacity to learn such representations required for human-level generalization. |
Kandan Ramakrishnan · R. James Cotton · Xaq Pitkow · Andreas Tolias 🔗 |
-
|
Adaptive Resolution Loss: An Efficient and Effective Loss for Time Series Hierarchical Contrastive Self-Supervised Learning Framework
(
Poster
)
link »
Time series data is a crucial form of information that has vast opportunities. With the widespread use of sensor networks, large-scale time series data has become ubiquitous. One of the current state-of-the-art SSL frameworks in time series is called TS2Vec. TS2Vec specially designs a hierarchical contrastive learning framework that uses loss-based training, which performs outstandingly against benchmark testing. However, the computational cost for TS2Vec is often significantly greater than other SSL frameworks. In this paper, we present a new self-supervised learning loss named, adaptive resolution loss. The proposed solution reduces the number of resolutions used for training the model via an adaptive selection score, leading to an efficient adaptive resolution loss based learning algorithm. In the experiment, we demonstrate that the proposed method preserves the original model's integrity while significantly enhancing its training time. |
Kevin Garcia · Juan Manuel Perez Jr · Yifeng Gao 🔗 |
-
|
How does semi-supervised learning with pseudo-labelers work? A case study
(
Poster
)
link »
Semi-supervised learning is a popular machine learning paradigm that utilizes a large amount of unlabeled data as well as a small amount of labeled data to facilitate learning tasks. While semi-supervised learning has achieved great success in training neural networks, its theoretical understanding remains largely open. In this paper, we aim to theoretically understand a semi-supervised learning approach based on pre-training and linear probing. We prove that, under a certain data generation model and two-layer convolutional neural network, the semi-supervised learning approach can achieve nearly zero test loss, while a neural network directly trained by supervised learning on the same amount of labeled data can only achieve constant test loss. Through this case study, we demonstrate a separation between semi-supervised learning and supervised learning in terms of test loss provided the same amount of labeled data. |
Yiwen Kou · Zixiang Chen · Yuan Cao · Quanquan Gu 🔗 |
-
|
Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation
(
Poster
)
link »
Transformer-based speech self-supervised learning (SSL) models, such as HuBERT, show surprising performance in various speech processing tasks. However, huge number of parameters in speech SSL models necessitate the compression to a more compact model for wider usage in academia or small companies. In this study, we suggest to reuse attention maps across the Transformer layers, so as to remove key and query parameters while retaining the number of layers. Furthermore, we propose a novel masking distillation strategy to improve the student model's speech representation quality. We extend the distillation loss to utilize both masked and unmasked speech frames to fully leverage the teacher model's high-quality representation. Our universal compression strategy yields the student model that achieves phoneme error rate (PER) of 7.72% and word error rate (WER) of 9.96% on SUPERB benchmark. |
Kangwook Jang · Sungnyun Kim · Se-Young Yun · HOI RIN KIM 🔗 |
-
|
Ring Attention with Blockwise Transformers for Near-Infinite Context
(
Poster
)
link »
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks with the computation of blockwise attention. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that are device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance. |
Hao Liu · Matei A Zaharia · Pieter Abbeel 🔗 |
-
|
Soft Contrastive Learning for Time Series
(
Poster
)
link »
In contrastive learning for time series, contrasting similar time series instances or values from adjacent timestamps within a time series leads to ignore their inherent correlations, deteriorating the quality of learned representations. To address this issue, we propose SoftCLT, a simple yet effective soft contrastive learning strategy for time series. This is achieved by introducing instance-wise and temporal contrastive loss with soft assignments. Specifically, we define soft assignments for 1) instance-wise contrastive loss by the distance between time series on the data space, and 2) temporal contrastive loss by the difference of timestamps. SoftCLT is a plug-and-play method for time series contrastive learning that improves the quality of learned representations. In experiments, we demonstrate that SoftCLT consistently improves the performance in various downstream tasks. |
Seunghan Lee · Taeyoung Park · Kibok Lee 🔗 |
-
|
SAMCLR: Contrastive pre-training on complex scenes using SAM for view sampling
(
Poster
)
link »
In Computer Vision, self-supervised contrastive learning enforces similar representations between different views of the same image. The pre-training is most often performed on image classification datasets, like ImageNet, where images mainly contain a single class of objects. However, when dealing with complex scenes with multiple items, it becomes very unlikely for several views of the same image to represent the same object category. In this setting, we propose SAMCLR, an add-on to SimCLR which uses SAM to segment the image into semantic regions, then sample the two views from the same region. Preliminary results show empirically that when pre-training on Cityscapes and ADE20K, then evaluating on classification on CIFAR-10, STL10 and ImageNette, SAMCLR performs at least on par with, and most often significantly outperforms not only SimCLR, but also DINO and MoCo. |
Benjamin Missaoui · Chongbin Yuan 🔗 |
-
|
Language Model Training Paradigms for Clinical Feature Embeddings
(
Poster
)
link »
In research areas with scarce data, representation learning plays a significant role. This work aims to enhance representation learning for clinical time series by deriving universal embeddings for clinical features, such as heart rate and blood pressure. We use self-supervised training paradigms for language models to learn high-quality clinical feature embeddings, achieving a finer granularity than existing time-step and patient-level representation learning. We visualize the learnt embeddings via unsupervised dimension reduction techniques and observe a high degree of consistency with prior clinical knowledge. We also evaluate the model performance on the MIMIC-III benchmark and demonstrate the effectiveness of using clinical feature embeddings. |
Yurong Hu · Manuel Burger · Gunnar Rätsch · Rita Kuznetsova 🔗 |
-
|
Structuring Representation Geometry with Rotationally Equivariant Contrastive Learning
(
Poster
)
link »
Self-supervised learning converts raw perceptual data such as images to a compact space where simple Euclidean distances measure meaningful variations in data. In this paper, we extend this formulation by adding additional geometric structure to the embedding space by enforcing transformations of input space to correspond to simple (i.e., linear) transformations of embedding space. Specifically, in the contrastive learning setting, we introduce an equivariance objective and theoretically prove that its minima forces augmentations on input space to correspond to rotations on the spherical embedding space. We show that merely combining our equivariant loss with a non-collapse term results in non-trivial representations, without requiring invariance to data augmentations. Optimal performance is achieved by also encouraging approximate invariance, where input augmentations correspond to small rotations. Our method, CARE: $\textbf{C}$ontrastive $\textbf{A}$ugmentation-induced $\textbf{R}$otational $\textbf{E}$quivariance, leads to improved performance on downstream tasks and ensures sensitivity in embedding space to important variations in data (e.g., color) that standard contrastive methods do not achieve.
|
Sharut Gupta · Joshua Robinson · Derek Lim · Soledad Villar · Stefanie Jegelka 🔗 |
-
|
Multimodal Distillation of CLIP Models
(
Poster
)
link »
CLIP-style models are powerful tools for zero-shot image-text tasks, but contain a very large number of parameters, making them expensive to deploy in hardware-constrained settings. We introduce a novel way to distill these large CLIP-based models into significantly smaller ones. Our method is called multimodal distillation because we jointly train two student networks (operating on image and text) from two teacher networks. Our loss tries to preserve the structure of the embeddings of the dataset, as provided by the image and text teacher networks. We are thus able to extract information from the interaction of the teacher embeddings, improving performance on downstream classification tasks. |
Georgios Smyrnis · Sriram Ravula · Sujay Sanghavi · Alex Dimakis 🔗 |
-
|
Augmentation matters: Representation learning for Strong Gravitational lensing
(
Poster
)
link »
Self-supervised learning has been known for learning good representations from data without the need for annotated labels. We explore the simple siamese (SimSiam) architecture for representation learning on strong gravitational lens images. Commonly used image augmentations tend to change lens properties; for example, zoom-in would affect the Einstein radius. To create image pairs representing the same underlying lens model, we introduce a lens augmentation method to preserve lens properties by fixing the lens model while varying the source galaxies. Our research demonstrates this lens augmentation works well with SimSiam for learning the lens image representation without labels, so we name it LenSiam. We also show that a pre-trained LenSiam model can benefit downstream tasks. We plan to open-source our code and datasets. |
Kuan-Wei Huang · Po-Wen Chang · Joshua Fagin · James Chan · Joshua Yao-Yu Lin 🔗 |
-
|
Online Feature Updates Improve Online (Generalized) Label Shift Adaptation
(
Poster
)
link »
This paper addresses the prevalent issue of label shift in an online setting with missing labels, where data distributions change over time and obtaining timely labels is challenging. While existing methods primarily focus on adjusting or updating the final layer of a pre-trained classifier, we delve into the untapped potential of enhancing feature representations using unlabeled data at test-time. Our novel Online Label Shift adaptation with Online Feature Updates (OLS-OFU) method harnesses self-supervised learning to refine the feature extraction process, thus improving the prediction model. Theoretical analyses confirm that OLS-OFU reduces algorithmic regret by capitalizing on self-supervised learning for feature refinement. Empirical tests on CIFAR-10 and CIFAR-10C datasets, under both online label shift and generalized label shift conditions, underscore OLS-OFU's effectiveness and robustness, especially in cases of domain shifts. |
Ruihan Wu · Siddhartha Datta · Yi Su · Dheeraj Baby · Yu-Xiang Wang · Kilian Weinberger 🔗 |
-
|
FroSSL: Frobenius Norm Minimization for Self-Supervised Learning
(
Poster
)
link »
Self-supervised learning (SSL) is an increasingly popular paradigm for representation learning. Recent methods can be classified as sample-contrastive, dimension-contrastive, or asymmetric network-based, with each family having its own approach to avoiding informational collapse. While dimension-contrastive methods converge to similar solutions as sample-contrastive methods, it can be empirically shown that some methods require more epochs of training to converge. Motivated by closing this divide, we present the objective function FroSSL which is both sample- and dimension-contrastive up to embedding normalization. FroSSL works by minimizing covariance Frobenius norms for avoiding collapse and minimizing mean-squared error for augmentation invariance. We show that FroSSL converges more quickly than a variety of other SSL methods and provide theoretical and empirical support that this faster convergence is due to how FroSSL affects the eigenvalues of the embedding covariance matrices. We also show that FroSSL learns competitive representations on linear probe evaluation when used to train a ResNet18 on the CIFAR-10, CIFAR-100, STL-10, and ImageNet datasets. |
Oscar Skean · Aayush Dhakal · Nathan Jacobs · Luis Sanchez Giraldo 🔗 |
-
|
Spectral Temporal Contrastive Learning
(
Poster
)
link »
Learning useful data representations without requiring labels is a cornerstone of modern deep learning. Self-supervised learning methods, particularly contrastive learning (CL), have proven successful by leveraging data augmentations to define positive pairs. This success has prompted a number of theoretical studies to better understand CL and investigate theoretical bounds for downstream linear probing tasks. This work is concerned with the temporal contrastive learning (TCL) setting where the sequential structure of the data is used instead to define positive pairs, which is more commonly used in RL and robotics contexts. In this paper, we adapt recent work on Spectral CL to formulate Spectral Temporal Contrastive Learning (STCL). We discuss a population loss based on a state graph derived from a time-homogeneous reversible Markov chain with uniform stationary distribution. The STCL loss enables to connect the linear probing performance to the spectral properties of the graph, and can be estimated by considering previously observed data sequences as an ensemble of MCMC chains. |
Sacha Morin · Somjit Nath · Samira Ebrahimi Kahou · Guy Wolf 🔗 |
-
|
Self-Supervised Pretraining for Improved Downstream Decoding of Audio-Evoked fMRI Sequences
(
Poster
)
link »
We present a sequential transfer learning framework for transformers on functional Magnetic Resonance Imaging (fMRI) data and demonstrate its significant benefits for decoding instrumental timbre. In the first of two phases, we pretrain our stacked-encoder transformer architecture on Next Thought Prediction, a self-supervised task of predicting whether or not one sequence of fMRI data follows another. This phase imparts a general understanding of the temporal and spatial dynamics of neural activity, and can be applied to any fMRI dataset. In the second phase, we finetune the pretrained models and train additional randomly initialized models on the supervised task of predicting whether or not two sequences of fMRI data were obtained while listening to the same musical timbre. The finetuned models achieve significantly higher accuracy on heldout participants than the randomly initialized models, demonstrating the efficacy of our framework for facilitating transfer learning on fMRI data. This work contributes to the growing literature on transformer architectures for sequential transfer learning on fMRI data. |
Sean Paulsen · Mike Casey 🔗 |
-
|
Exploring Target Representations for Masked Autoencoders
(
Poster
)
link »
Masked autoencoders have become popular training paradigms for self-supervised visual representation learning. These models randomly mask a portion of the input and reconstruct the masked portion according to assigned target representations. In this paper, we show that a careful choice of the target representation is unnecessary for learning good visual representation since different targets tend to derive similarly behaved models. Driven by this observation, we propose a multi-stage masked distillation pipeline and use a randomly initialized model as the teacher, enabling us to effectively train high-capacity models without any effort to carefully design the target representation. On various downstream tasks, the proposed method to perform masked knowledge distillation with bootstrapped teachers (dbot) outperforms previous self-supervised methods by nontrivial margins. We hope our findings, as well as the proposed method, could motivate people to rethink the roles of target representations in pre-training masked autoencoders. |
xingbin liu · Jinghao Zhou · Tao Kong 🔗 |
-
|
Non-Vacuous Generalization Bounds for Large Language Models
(
Poster
)
link »
Modern language models can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply regurgitate their training corpora. We provide the first non-vacuous generalization bounds for pretrained large language models (LLMs), indicating that language models are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss, and we extend the bound to handle subsampling, accelerating bound computation on massive datasets. To achieve the extreme level of compression required for non-vacuous generalization bounds, we devise SubLoRA, a low-dimensional non-linear parameterization. Using this approach, we find that larger models have better generalization bounds and are more compressible than smaller models. |
Sanae Lotfi · Marc Finzi · Yilun Kuang · Tim G. J. Rudner · Micah Goldblum · Andrew Wilson 🔗 |
-
|
Bootstrap Your Own Variance
(
Poster
)
link »
Understanding model uncertainty is important for many applications. We propose Bootstrap Your Own Variance (BYOV), combining Bootstrap Your Own Latent (BYOL), a negative-free Self-Supervised Learning (SSL) algorithm, with Bayes by Backprop (BBB), a Bayesian method for estimating model posteriors. We find that the learned predictive std of BYOV vs. a supervised BBB model is well captured by a Gaussian distribution, providing preliminary evidence that the learned parameter posterior is useful for label free uncertainty estimation. BYOV improves upon the deterministic BYOL baseline (+2.83% test ECE, +1.03% test Brier) and presents better calibration and reliability when tested with various augmentations (eg: +2.4% test ECE, +1.2% test Brier for Salt & Pepper noise). |
Polina Turishcheva · Jason Ramapuram · Sinead Williamson · Dan Busbridge · Eeshan Gunesh Dhekane · Russell Webb 🔗 |
-
|
Self-supervised Representation Learning from Random Data Projectors
(
Poster
)
link »
Augmentation-based SSRL algorithms have pushed the boundaries of performance in computer vision and natural language processing, but are often not directly applicable to other data modalities, and can conflict with application-specific data augmentation constraints. We present an SSRL approach that can be applied to any data modality because it does not rely on augmentations. We show that high-quality data representations can be learned by reconstructing random data projections, and evaluate the proposed approach on a range of representation learning tasks that span diverse modalities and real-world applications. |
Yi Sui · Tongzi Wu · Jesse Cresswell · Ga Wu · George Stein · Xiao Shi Huang · Xiaochen Zhang · Maksims Volkovs 🔗 |
-
|
An Information-Theoretic Understanding of Maximum Manifold Capacity Representations
(
Poster
)
link »
Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is interesting for at least two reasons. Firstly, MMCR is an oddity in the zoo of MVSSL methods: it is not (explicitly) contrastive, applies no masking, performs no clustering, leverages no distillation, and does not (explicitly) reduce redundancy. Secondly, while many self-supervised learning (SSL) methods originate in information theory, MMCR distinguishes itself by claiming a different origin: a statistical mechanical characterization of the geometry of linear separability of data manifolds. However, given the rich connections between statistical mechanics and information theory, and given recent work showing how many SSL methods can be understood from an information-theoretic perspective, we conjecture that MMCR can be similarly understood from an information-theoretic perspective. In this paper, we leverage tools from high dimensional probability and information theory to demonstrate that an optimal solution to MMCR's nuclear norm-based objective function is the same optimal solution that maximizes a well-known lower bound on mutual information. |
Berivan Isik · Rylan Schaeffer · Victor Lecomte · Mikail Khona · Yann LeCun · Sanmi Koyejo · Andrey Gromov · Ravid Shwartz-Ziv 🔗 |
-
|
Self-Distilled Representation Learning for Time Series
(
Poster
)
link »
Self-supervised learning for time-series data holds potential similar to that recently unleashed in Natural Language Processing and Computer Vision. While most existing works in this area focus on contrastive learning, we propose a conceptually simple yet powerful non-contrastive approach, based on the data2vec self-distillation framework. The core of our method is a student-teacher scheme that predicts the latent representation of an input time series from masked views of the same time series. This strategy avoids strong modality-specific assumptions and biases typically introduced by the design of contrastive sample pairs. We demonstrate the competitiveness of our approach for classification and forecasting as downstream tasks, comparing with state-of-the-art self-supervised learning methods on the UCR and UEA archives as well as the ETT and Electricity datasets. |
Felix Pieper · Konstantin Ditschuneit · Martin Genzel · Alexandra Lindt · Johannes Otterbach 🔗 |
-
|
Making Self-supervised Learning Robust to Spurious Correlation via Learning-speed Aware Sampling
(
Poster
)
link »
Self-supervised learning (SSL) has emerged as a powerful technique for learning rich representations from unlabeled data. The data representations are able to capture many underlying attributes of data, and be useful in downstream prediction tasks. In real-world settings, spurious correlations between some attributes (e.g. race, gender and age) and labels for downstream tasks often exist, e.g. cancer is usually more prevalent among elderly patients. In this paper, we investigate the learning dynamics of SSL and observe that the learning is slower for samples that conflict with such correlations (e.g. elder patients without cancer). Motivated by these findings, we propose a learning-speed aware SSL (LA-SSL) approach, in which we sample each training data with a probability that is inversely related to its learning speed. We evaluate LA-SSL on three datasets that exhibit spurious correlations between different attributes, demonstrating that it improves the robustness of pretrained representations on downstream classification tasks. |
Weicheng Zhu · Sheng Liu · Carlos Fernandez-Granda · Narges Razavian 🔗 |
-
|
Generalized Category Discovery with Hierarchical Label Smoothing
(
Poster
)
link »
\textit{Generalized Category Discovery} seeks to cluster unknown categories while simultaneously discerning known ones. Existing approaches mostly rely on contrastive learning to produce distinctive embeddings for both labeled and unlabeled data. Yet, these methods often suffer from dispersed clusters for unknown categories due to a high rate of false negatives. To alleviate this problem, we introduce label smoothing as a hyperparameter that permits ‘forgivable mistakes’ for visually similar samples. We introduce a self-supervised cluster hierarchy, which allows us to control the strength of label smoothing to apply. By assigning pseudo-labels to emerging cluster candidates and using these as ‘soft supervision’ for contrastive learning, we effectively combine the benefits of clustering-based learning and contrastive learning. We demonstrate state-of-the-art generalized category discovery performance on various fine-grained datasets. |
Sarah Rastegar · Yuki M M Asano · Hazel Doughty · Cees Snoek 🔗 |
-
|
Benchmarking self-supervised video representation learning
(
Poster
)
link »
Self-supervised learning is an effective way for label-free model pre-training, especially in the video domain where labeling is expensive. Existing self-supervised works in the video domain use varying experimental setups to demonstrate their effectiveness and comparison across approaches becomes challenging with no standard benchmark. In this work, we first provide a benchmark that enables a comparison of existing approaches on the same ground. Next, we study five different aspects of self-supervised learning important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4) data noise, and, 5) feature analysis. To facilitate this study, we focus on seven different methods along with seven different network architectures and perform an extensive set of experiments on 5 different datasets with an evaluation of two different downstream tasks. We present several interesting insights from this study which span across different properties of pretraining and target datasets, pretext-tasks, and model architectures among others. We further put some of these insights to the real test and propose an approach that requires a limited amount of training data and outperforms existing state-of-the-art approaches which use 10x pretraining data. We believe this work will pave the way for researchers to a better understanding of self-supervised pretext tasks in video representation learning. |
Akash Kumar · Ashlesha Kumar · Vibhav Vineet · Yogesh Rawat 🔗 |
-
|
Neurosymbolic Grounding for Compositional Generalization
(
Poster
)
link »
We introduce COSMOS, a framework for object-centric world modeling that is designed for compositional generalization (CG), i.e., high performance on unseen input scenes obtained through the composition of known visual “atoms.” The central insight behind COSMOS is the use of a novel form of neurosymbolic grounding. Specifically, the framework introduces two new tools: (i) neurosymbolic scene encodings, which represent each entity in a scene using a real vector computed using a neural encoder, as well as a vector of composable symbols describing attributes of the entity, and (ii) a neurosymbolic attention mechanism that binds these entities to learned rules of interaction. COSMOS is end-to-end differentiable; also, unlike traditional neurosymbolic methods that require representations to be manually mapped to symbols, it computes an entity’s symbolic attributes using vision-language foundation models. Through an evaluation that considers two different forms of CG on an established blocks-pushing domain, we show that the framework establishes a new state-of-the-art for CG in world modeling. |
Atharva Sehgal · Arya Grayeli · Jennifer J Sun · Swarat Chaudhuri 🔗 |
-
|
No Representation Rules Them All in Category Discovery
(
Poster
)
link »
In this paper we tackle the problem of Generalized Category Discovery (GCD). Given a dataset with labelled and unlabelled images, the task is to cluster all images in the unlabelled subset, whether or not they belong to the labelled categories. Our first contribution is to recognize that most existing GCD benchmarks only contain labels for a single clustering of the data, making it difficult to ascertain whether models are using the available labels to solve the GCD task, or simply solving an unsupervised clustering problem. As such, we present a synthetic dataset, named 'Clevr-4', for category discovery. 'Clevr-4' contains four equally valid partitions of the data, i.e. based on object shape, texture, color or count. To solve the task, models are required to extrapolate the taxonomy specified by the labelled set, rather than simply latching onto a single natural grouping of the data. We use this dataset to demonstrate the limitations of unsupervised clustering in the GCD setting; showing that even very strong unsupervised models fail on 'Clevr-4', and further reveal that they each have characteristic biases from their pre-training. We also use 'Clevr-4' to examine the weaknesses of existing GCD algorithms, and propose a new method which addresses these shortcomings, outperforming state-of-the-art models on 'Clevr-4' and the challenging Semantic Shift Benchmark. |
Sagar Vaze · Andrea Vedaldi · Andrew Zisserman 🔗 |
-
|
Application of Self Supervised Vision Transformers for Multiplexed Microscopy Images and Its Challenges
(
Poster
)
link »
Multiplexed microscopy imaging enables the simultaneous use of numerous fluorescent markers on one biological sample. This technique is especially useful in cancer research, cellular and molecular biology, and drug discovery. Studying these microscopic images is challenging due to the large scale of datasets, number of channels that exceeds the natural imaging domain, and the lack of annotations. In this work, we applied a self-supervised learning method for representation learning, and then studied the quality of the learned representations visually and by classification tasks. Results show that although the model creates similar feature embeddings for the same metadata labels, the model also captures some technical variation between slides. |
Gantugs Atarsaikhan · Isabel Mogollon · Katja Välimäki · Teijo Pellinen · Tuomas Mirtti · Lassi Paavolainen 🔗 |
-
|
HyperMAE: Modulating Implicit Neural Representations for MAE Training
(
Poster
)
link »
Implicit Neural Representations (INRs) have been applied successfully for reconstruction tasks in computer vision. However, to the best of our knowledge, using INRs for self-supervised visual recognition has not been explored. In this work, we propose HyperMAE, an INR version of the masked autoencoder (MAE). HyperMAE combines a transformer and a coordinate-MLP to form an efficient decoder architecture that maps the patch coordinates to all pixels in the patch, conditioned on the encoder outputs. The conditioning is implemented as the weight modulation of the coordinate-MLP, which is an INR of the image. Compared with the standard MAE, HyperMAE achieves comparable ImageNet-1k finetuning accuracy with only 72.9\% pretraining time using 56.5\% GPU memory and 46.5\% pretraining time using 88.6\% GPU memory. We hope our work could inspire further investigation on INRs for self-supervised learning. The code is available at https://github.com/cvlab-stonybrook/HyperMAE. |
Varun Belagali · Lei Zhou · Xiang Li · Dimitris Samaras 🔗 |
-
|
The Triad of Failure Modes and a Possible Way Out
(
Poster
)
link »
We present a novel objective function for cluster-based self-supervised learning (SSL) that is designed to circumvent the triad of failure modes, namely representation collapse, cluster collapse, and the problem of invariance to permutations of cluster assignments. This objective consists of three key components: (i) A generative term that penalizes representation collapse, (ii) a term that promotes invariance to data augmentations, thereby addressing the issue of label permutations and (ii) a uniformity term that penalizes cluster collapse. Moreover, our proposed objective possesses two notable advantages. Firstly, it can be interpreted from a Bayesian perspective as a lower bound on the data log-likelihood. Secondly, it enables the training of a standard backbone architecture without the need for asymmetric elements like stop gradients, momentum encoders, or specialized clustering layers. Due to its simplicity and theoretical foundation, our proposed objective is well-suited for optimization. Experiments on both toy and real world data demonstrate its effectiveness. |
Emanuele Sansone 🔗 |
-
|
Identifiable attribution maps using regularized contrastive learning
(
Poster
)
link »
Gradient-based attribution methods aim to explain decisions of deep learning models, but so far lack identifiability guarantees. Here, we propose a method to generate attribution maps with identifiability guarantees by developing a regularized contrastive learning algorithm trained on time series data with continuous target labels. We show theoretically that our formulation of hybrid contrastive learning has favorable properties for identifying the Jacobian matrix of the data generating process, and is unable to overfit to random training distributions. Empirically, we demonstrate robust approximation of the ground-truth attribution map on synthetic data, and significant improvements across previous attribution methods based on feature ablation, Shapley values, and other gradient-based methods. Our work constitutes a first example of identifiable inference of attribution maps, and opens avenues for improving future attribution tools and better understanding neural dynamics and neural networks. |
Steffen Schneider · Rodrigo González Laiz · Markus Frey · Mackenzie Mathis 🔗 |
-
|
On Improving the Sample Efficiency of Non-Contrastive SSL
(
Poster
)
link »
In this work, we provide theoretical insights on the implicit bias of the BarlowTwins and VICReg loss that can explain these heuristics and guide the development of more principled recommendations. Our first insight is that the orthogonality of the features is more important than projector dimensionality for learning good representations. Based on this, we empirically demonstrate that low-dimensional projector heads are sufficient with appropriate regularization, contrary to the existing heuristic. Our second theoretical insight suggests that using multiple data augmentations better represents the desiderata of the SSL objective. Based on this, we demonstrate that leveraging more augmentations per sample improves representation quality and trainability. In particular, it improves optimization convergence, leading to better features emerging earlier in the training. Remarkably, we demonstrate that we can reduce the pretraining dataset size by up to 4x while maintaining accuracy and improving convergence simply by using more data augmentations. Combining these insights, we present pretraining recommendations that improve wall-clock time by 2x and downstream performance on CIFAR-10/STL-10 datasets. |
Kumar Krishna Agrawal · Arna Ghosh · Adam Oberman · Blake Richards 🔗 |
-
|
Perceptual Group Tokenizer: Building Perception with Iterative Grouping
(
Poster
)
link »
Human visual recognition system shows astonishing capability of compressing visual information into a set of tokens containing rich representations without label supervision. One critical driving principle behind it is perceptual grouping. Despite being widely used in computer vision in the early 2010s, it remains a mystery whether perceptual grouping can be leveraged to derive a neural visual recognition backbone that generates as powerful representations. In this paper, we propose the Perceptual Group Tokenizer, a model that entirely relies on grouping operations to extract visual features and perform self-supervised representation learning, where a series of grouping operations are used to iteratively hypothesize the context for pixels or superpixels to refine feature representations. We show that the proposed model can achieve competitive performance compared to state-of-the-art vision architectures, and inherits desirable properties including adaptive computation without re-training, and interpretability. Specifically, Perceptual Group Tokenizer achieves 79.7% on ImageNet-1K self-supervised learning benchmark with linear probe evaluation, marking a new progress under this paradigm. |
Zhiwei Deng · Ting Chen · Yang Li 🔗 |
-
|
On the Varied Faces of Overparameterization in Supervised and Self-Supervised Learning
(
Poster
)
link »
The quality of the representations learned by neural networks depends on several factors, including the loss function, learning algorithm, and model architecture. In this work, we use information geometric measures to assess the representation quality in a principled manner. We demonstrate that the sensitivity of learned representations to input perturbations, measured by the spectral norm of the feature Jacobian, provides valuable information about downstream generalization. On the other hand, measuring the coefficient of spectral decay observed in the eigenspectrum of feature covariance provides insights into the global representation geometry. First, we empirically establish an equivalence between these notions of representation quality and show that they are inversely correlated. Second, our analysis reveals the varying roles that overparameterization plays in improving generalization. Unlike supervised learning, we observe that increasing model width leads to higher discriminability and less smoothness in the self-supervised regime.Furthermore, we report that there is no observable double descent phenomenon in SSL with non-contrastive objectives for commonly used parameterization regimes, which opens up new opportunities for tight asymptotic analysis. Taken together, our results provide a loss-aware characterization of the different role of overparameterization in supervised and self-supervised learning. |
Matteo Gamba · Arna Ghosh · Kumar Krishna Agrawal · Blake Richards · Hossein Azizpour · Mårten Björkman 🔗 |
-
|
Learning Useful Representations of Recurrent Neural Network Weight Matrices
(
Poster
)
link »
Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the "mechanistic approach" directly looks at some RNN's weights to predict its behavior, the "functionalist approach'" analyzes its overall functionality---specifically, its input-output mapping. Our two novel functionalist approaches extract information from RNN weights by 'interrogating' the RNN through probing inputs. Our novel theoretical framework for the functionalist approach demonstrates conditions under which it can generate rich representations that help determine RNN behavior. RNN weight representations generated by mechanistic and functionalist approaches are compared by evaluating them in two downstream tasks. Our results show the superiority of functionalist methods. |
Vincent Herrmann · Francesco Faccio · Jürgen Schmidhuber 🔗 |
-
|
MolSiam: Simple Siamese Self-supervised Representation Learning for Small Molecules
(
Poster
)
link »
We investigate a self-supervised learning technique from the Simple Siamese (SimSiam) Representation Learning framework on 2D molecule graphs. SimSiam does not require negative samples during training, making it 1) more computationally efficient and 2) less vulnerable to faulty negatives compared with contrastive learning. Leveraging unlabeled molecular data, we demonstrate that our approach, MolSiam, effectively captures the underlying features of molecules and shows that those with similar properties tend to cluster in UMAP analysis. By fine-tuning pre-trained MolSiam models, we observe performance improvements across four downstream therapeutic property prediction tasks without training with negative pairs. |
Joshua Yao-Yu Lin · Michael Maser · Nathan Frey · Gabriele Scalia · Omar Mahmood · Pedro O. Pinheiro · Ji Won Park · Stephen Ra · Andrew Watkins · Kyunghyun Cho 🔗 |
-
|
Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations
(
Poster
)
link »
Self-supervised representation learning often uses data augmentations to induce some invariance to “style” attributes of the data. However, with downstream tasks generally unknown at training time, it is difficult to deduce a priori which attributes of the data are indeed “style” and can be safely discarded. To address this, we introduce a more principled approach that seeks to disentangle style features rather than discard them. The key idea is to add multiple style embedding spaces where: (i) each is invariant to all-but-one augmentation; and (ii) joint entropy is maximized. We empirically demonstrate the benefits of our approach on synthetic datasets and then present promising but limited results on ImageNet. |
Cian Eastwood · Julius von Kügelgen · Linus Ericsson · Diane Bouchacourt · Pascal Vincent · Bernhard Schölkopf · Mark Ibrahim 🔗 |
-
|
Does Unconstrained Unlabeled Data Help Semi-Supervised Learning?
(
Poster
)
link »
In this work, we study the role of unconstrained unlabeled data in semi-supervised learning and propose a semi-supervised learning framework which can learn effective representations from such unlabeled data. Most existing semi-supervised methods rely on the assumption that labelled and unlabeled samples are drawn from the same distribution, which limits the potential for improvement through the use of free-living unlabeled data. Consequently, the generalizability and scalability of semi-supervised learning are often hindered by this assumption. Our method aims to overcome these constraints and effectively utilize unconstrained unlabeled data in semi-supervised learning. UnMixMatch consists of three main components: a supervised learner with hard augmentations that provides strong regularization, a contrastive consistency regularizer to learn underlying representations from the unlabeled data and a self-supervised loss to enhance the representations that are learnt from the unlabeled data. We perform extensive experiments on 4 commonly used datasets and demonstrate superior performance over existing semi-supervised methods with a performance boost of 4.79\%. Extensive ablation and sensitivity studies show the effectiveness of each of the proposed components of our method. |
Shuvendu Roy · Ali Etemad 🔗 |
-
|
Self-supervised Learning for User Sequence Modeling
(
Poster
)
link »
Self-supervised learning (SSL) has proven to be very effective in learning representations from unlabeled data, especially in vision and NLP tasks. We aim to transfer this success to user sequence modeling, where users perform a sequence of actions from a large discrete domain (e.g. video views, movie ratings). Since the data type is completely different from images or natural language, we can no longer use pretrained foundation models and must find an efficient way to train from scratch. In this work, we propose an adaptation of Barlow Twins, with a suitable augmentation method and architecture for user sequence data. We evaluate our method on the MovieLens 1M, MovieLens 20M, and Yelp datasets, observing an 8%-20% improvement in accuracy on three downstream tasks compared to the dual encoder model, which is commonly used for user modeling in recommendation systems. Our method can help to learn useful sequence-level information for user modeling, and it is especially beneficial with limited labeled data. |
Yuhan Liu · Lin Ning · Neo Wu · Karan Singhal · Philip Mansfield · Devora Berlowitz · Bradley Green 🔗 |
-
|
LiDAR: Sensing Linear Probing Performance in Joint Embedding SSL Architectures
(
Poster
)
link »
Joint embedding (JE) architectures have emerged as a promising avenue for acquiring transferable data representations. A key obstacle to using JE methods, however, is the inherent challenge of evaluating learned representations without access to a downstream task, and an annotated dataset. Without efficient and reliable evaluation, it is difficult to iterate on architectural and training choices for JE methods. In this paper, we introduce \emph{LiDAR} (Linear Discriminant Analysis Rank), a metric designed to measure the quality of representations within JE architectures. Our metric addresses several shortcomings of recent approaches based on feature covariance rank by discriminating between informative and uninformative features.In essence, \emph{LiDAR} quantifies the rank of the Linear Discriminant Analysis (LDA) matrix associated with the surrogate SSL task—a measure that intuitively captures the information content as it pertains to solving the SSL task. We empirically demonstrate that \emph{LiDAR} significantly surpasses naive rank based approaches in its predictive power of optimal hyperparameters. Our proposed criterion presents a more robust and intuitive means of assessing the quality of representations within JE architectures, which we hope facilitates broader adoption of these powerful techniques in various domains. |
Vimal Thilak · Chen Huang · Omid Saremi · Laurent Dinh · Hanlin Goh · Preetum Nakkiran · Joshua Susskind · Etai Littwin 🔗 |
-
|
Multi-Task Learning with Self-Supervised Objectives can Improve Worst-Group Outcomes
(
Poster
)
link »
In order to create machine learning systems that serve a variety of users well, it is important to not only achieve high performance on average but also ensure equitable outcomes across diverse groups. In this paper, we explore the potential of multi-task learning (MTL) with self-supervised objectives as a tool to address the challenge of group-wise fairness. We show that by regularizing the joint representation space during multi-tasking, we are able to obtain improvements on worst-group error. Through comprehensive experiments across NLP and CV datasets, we demonstrate that regularized multi-tasking with self-supervised learning competes favorably with state-of-the-art distributionally robust optimization methods. Our approach -- without introducing data external to the end-task -- improves worst-case group accuracy over empirical risk minimization by as much as $\sim4\%$ on average in settings where group annotations are completely unavailable.
|
Atharva Kulkarni · Lucio M Dery · Amrith Setlur · Aditi Raghunathan · Ameet Talwalkar · Graham Neubig 🔗 |
-
|
Can semi-supervised learning use all the data effectively? A lower bound perspective
(
Poster
)
link »
In the semi-supervised learning (SSL) setting both labeled and unlabeled datasets are available to the learning algorithm. While it is well-established from prior theoretical and empirical works that the inclusion of unlabeled data can help to improve over the error of supervised learning algorithms, existing theoretical examinations of SSL suggest a limitation: these algorithms might not efficiently leverage labeled data beyond a certain threshold. In this study, we derive a tight lower bound for 2-Gaussian mixture model distributions which exhibits an explicit dependence on the sizes of both the labeled and the unlabeled dataset. Surprisingly, our lower bound indicates that no SSL algorithm can surpass the sample complexities of minimax optimal supervised (SL) or unsupervised learning (UL) algorithms, which exclusively use either the labeled or the unlabelled dataset, respectively. Despite a change in the statistical error rate being unattainable, SSL can still outperform both SL and UL (up to permutation) in terms of absolute error. To this end, we provide evidence that there exist algorithms that can provably achieve lower error than both SL and UL algorithms. We validate our theoretical findings through linear classification experiments on synthetic and real-world data. |
Alexandru Tifrea · Gizem Yüce · Amartya Sanyal · Fanny Yang 🔗 |
-
|
Self-Supervised Image Captioning with CLIP
(
Poster
)
link »
Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. Current image captioning approaches heavily rely on high-quality image-caption pairs, which can be hard to obtain for many domains. To address this, we introduce a self-supervised image captioning method. After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data, leveraging the auxiliary task of enhancing the CLIP relevance between images and generated captions. Remarkably, despite utilizing less than 2\% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset. Human evaluations further reveal that our method produces captions with greater distinctiveness and informativeness, two attributes inherently challenging to achieve through supervised learning. |
Chuanyang Jin 🔗 |
-
|
Exploring Data Augmentations on Self-/Semi-/Fully- Supervised Pre-trained Models
(
Poster
)
link »
Data augmentation has become a standard component of vision pre-trained models to capture the invariance between augmented views. In practice, augmentation techniques that mask regions of a sample with zero/mean values or patches from other samples are commonly employed in pre-trained models with self-/semi-/fully-supervised contrastive losses.However, the underlying mechanism behind the effectiveness of these augmentation techniques remains poorly explored.To investigate the problems, we conduct an empirical study to quantify how data augmentation affects performance. Concretely, we apply 4 types of data augmentations termed with $\texttt{Random Erasing}$, $\texttt{CutOut}$, $\texttt{CutMix}$ and $\texttt{MixUp}$ to a series of self-/semi-/fully- supervised pre-trained models. We report their performance on vision tasks such as image classification, object detection, instance segmentation, and semantic segmentation. We then explicitly evaluate the invariance and diversity of the feature embedding.We observe that: 1) Masking regions of the images decreases the invariance of the learned feature embedding while providing a more considerable diversity. 2) Manual annotations do not change the invariance or diversity of the learned feature embedding. 3) The $\texttt{MixUp}$ approach improves the diversity significantly, with only a marginal decrease in terms of the invariance.
|
Shentong Mo · Zhun Sun · Chao Li 🔗 |
-
|
Simple Contrastive Representation Learning for Time Series Forecasting
(
Poster
)
link »
Contrastive learning methods have shown an impressive ability to learn meaningful representations for image and time series classification. However, these methods are less effective for time series forecasting, as optimization of instance discrimination is not directly applicable to predicting the future outcomes from the historical context. To address these limitations, we propose SimTS, a simple representation learning approach for improving time series forecasting by learning to predict the future from the past in the latent space. SimTS exclusively uses positive pairs and does not depend on negative pairs or specific characteristics of a given time series. In addition, we show the shortcomings of the common contrastive learning frameworks used for time series forecasting through a detailed ablation study. Overall, our work suggests that SimTS is a promising alternative to other contrastive learning approaches for time series forecasting. |
Xiaochen Zheng · Xingyu Chen · Manuel Schürch · Maolaaisha Aminanmu · Ahmed Allam · Michael Krauthammer 🔗 |
-
|
Iterated Piecewise Affine (IPA) Approximation for Language Modeling
(
Poster
)
link »
In this work, we demonstrate the application of a first-order Taylor expansion to approximate a generic function $F: R^{n \times m} \to R^{n \times m}$ and utilize it in language modeling. To enhance the basic Taylor expansion, we introduce iteration and piecewise modeling, leading us to name the algorithm the Iterative Piecewise Affine (IPA) approximation. The final algorithm exhibits interesting resemblances to the Transformers decoder architecture. By comparing parameter arrangements in IPA and Transformers, we observe a strikingly similar performance, with IPA outperforming Transformers by 1.5\% in the next token prediction task with cross-entropy loss for smaller sequence lengths.
|
Davood Shamsi · Wen-yu Hua · Brian Williams 🔗 |
-
|
CCA with Shared Weights for Self-Supervised Learning
(
Poster
)
link »
In this paper, we explore SSL-EY (Self-Supervised Learning with an Eckhart-Young characterization), a novel self-supervised learning loss function directly inspired by Deep Canonical Correlation Analysis (DCCA). Our key insight is that maximizing the correlation of learned representations can serve as both an effective and interpretable objective in self-supervised learning. We demonstrate that SSL-EY not only strengthens the theoretical underpinning of existing methods, such as Barlow Twins and VICReg, but also performs competitively on benchmark datasets. |
James Chapman · Lennie Wells 🔗 |
-
|
Understanding Self-Supervised Features for Learning Unsupervised Instance Segmentation
(
Poster
)
link »
Self-supervised learning (SSL) can be used to solve complex visual tasks without human labels. Self-supervised representations encode useful semantic information about images, and as a result, they have already been used for tasks such as unsupervised semantic segmentation.In this paper, we investigate self-supervised representations for instance segmentation without any manual annotations.We find that the features of different SSL methods vary in their level of instance-awareness. In particular, DINO features, which are known to be excellent semantic descriptors, lack behind MAE features in their sensitivity for separating instances. |
Paul Engstler · Luke Melas-Kyriazi · Christian Rupprecht · Iro Laina 🔗 |
-
|
Learn to Categorize or Categorize to Learn? Self-Coding for Generalized Category Discovery
(
Poster
)
link »
In the quest for unveiling novel categories at test time, we confront the inherent limitations of traditional supervised recognition models that are restricted by a predefined category set. While strides have been made in the realms of self-supervised and open-world learning towards test-time category discovery, a crucial yet often overlooked question persists: what exactly delineates a category? In this paper, we conceptualize a category through the lens of optimization. Harnessing this unique conceptualization, we propose a novel, efficient and self-supervised method capable of discovering previously unknown categories at test time. A salient feature of our approach is the assignment of minimum length category codes to individual data instances, which encapsulates the implicit category hierarchy prevalent in real-world datasets, especially fine-grained categories. Experimental state-of-the-art comparisons, testify to the efficacy of our solution in managing unknown categories at test time. Furthermore, we fortify our proposition with a theoretical foundation, providing proof of its optimality. Our code is available at: \url{https://github.com/SarahRastegar/InfoSieve}. |
Sarah Rastegar · Hazel Doughty · Cees Snoek 🔗 |
-
|
Zero-shot Clustering of Embeddings with Self-Supervised Learnt Encoders
(
Poster
)
link »
We explore whether self-supervised pretrained models can provide a useful representation space for datasets they were not trained on, and whether these representations can be used to group novel unlabelled data into meaningful clusters. To this end, we conduct experiments using image representation encoders pretrained on ImageNet using a variety of self-supervised training techniques. These encoders are deployed on image datasets that were not seen during training, without fine-tuning, and we investigate whether their embeddings can be clustered with conventional clustering algorithms. We find that it is possible to create well-defined clusters using self-supervised feature encoders, especially when using the Agglomerative Clustering method, and that it is possible to do so even for very fine-grained datasets such as NABirds. We also find indications that the Silhouette score is a good proxy of cluster quality when no ground-truth is available. |
Scott Lowe · Joakim Bruslund Haurum · Sageev Oore · Thomas Moeslund · Graham Taylor 🔗 |
-
|
SurgMAE: Masked Autoencoders for Long Surgical Video Analysis
(
Poster
)
link »
There has been a growing interest in using deep learning models for processing long surgical videos, in order to automatically detect specific clinical/operational activities and extract metrics that can enable workflow efficiency tools and applications. However, training such models require vast amounts of labeled data which is costly and not scalable. Recently, self-supervised learning has been explored in computer vision community to reduce the burden of the annotation cost. Masked autoencoders (MAE) got the attention in self-supervised paradigm for Vision Transformers (ViTs) by predicting the randomly masked regions given the visible patches of an image or a video clip, and have shown superior performance on benchmark datasets. However, the application of MAE in surgical data remains unexplored. In this paper, we first investigate whether MAE can learn transferrable representations in surgical video domain. We propose SurgMAE, which is a novel architecture with an intelligent masking strategy based on sampling tokens corresponding to high information spatio-temporal regions unlike random and tube masking for MAE. We provide an empirical study of SurgMAE on two large scale long surgical video datasets, and find that our method outperforms several baselines in low data regime. We conduct extensive ablation studies to show the efficacy of our approach and also demonstrate it's superior performance on UCF-101 to prove it's generalizability in non-surgical datasets as well. |
Muhammad Abdullah Jamal · Omid Mohareri 🔗 |
-
|
Bridging State and History Representations: Understanding Self-Predictive RL
(
Poster
)
link »
Representations are at the core of all deep reinforcement learning (RL) methods for both Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs). Many representation learning methods and theoretical frameworks have been developed to understand what constitutes an effective representation. However, the relationships between these methods and the shared properties among them remain unclear. In this paper, we show that many of these seemingly distinct methods and frameworks for state and history abstractions are, in fact, based on a common idea of self-predictive abstraction. Furthermore, we provide theoretical insights into the widely adopted stop-gradient technique for learning self-predictive representations. |
Tianwei Ni · Benjamin Eysenbach · Erfan SeyedSalehi · Michel Ma · Clement Gehring · Aditya Mahajan · Pierre-Luc Bacon 🔗 |
-
|
Scaling may be all you need for achieving human-level object recognition with human-like visual experience
(
Poster
)
link »
This paper asks whether current self-supervised learning methods, if sufficiently scaled up, would be able to reach human-level visual object recognition capabilities with the same type and amount of visual experience humans learn from. Previous work on this question only considered the scaling of data size. Here, we consider the simultaneous scaling of data size, model size, and image resolution. We perform a scaling experiment with vision transformers up to 633M parameters in size (ViT-H/14) trained with up to 5K hours of human-like video data (long, continuous, mostly egocentric videos) with image resolutions of up to 476$\times$476 pixels. The efficiency of masked autoencoders (MAEs) as a self-supervised learning algorithm makes it possible to run this scaling experiment on an unassuming academic budget. We find that it is feasible to reach human-level object recognition capacity at sub-human scales of model size, data size, and image size, if these factors are scaled up simultaneously. To give a concrete example, we estimate that a 2.5B parameter ViT model trained with 20K hours (2.3 years) of human-like video data with a spatial resolution of 952$\times$952 pixels should be able to reach roughly human-level accuracy on ImageNet. Human-level competence is thus achievable for a fundamental perceptual capability from human-like perceptual experience (human-like in both amount and type) with extremely generic learning algorithms and architectures and without any substantive inductive biases.
|
Emin Orhan 🔗 |
-
|
Augmentation-aware Self-Supervised Learning with Conditioned Projector
(
Poster
)
link »
Self-supervised learning (SSL) is a powerful technique for learning robust representations from unlabeled data. By learning to remain invariant to applied data augmentations, methods such as SimCLR and MoCo are able to reach quality on par with supervised approaches. However, this invariance may be harmful to solving some downstream tasks which depend on traits affected by augmentations used during pretraining, such as color. In this paper, we propose to foster sensitivity to such characteristics in the representation space by modifying the projector network, a common component of self-supervised architectures. Specifically, we supplement the projector with information about augmentations applied to images. In order for the projector to take advantage of this auxiliary conditioning when solving the SSL task, the feature extractor learns to preserve the augmentation information in its representations. Our approach, coined Conditional Augmentation-aware Self-supervised Learning (CASSLE), is directly applicable to typical joint-embedding SSL methods regardless of their objective functions. Moreover, it does not require major changes in the network architecture or prior knowledge of downstream tasks. In addition to an analysis of sensitivity towards different data augmentations, we conduct a series of experiments, which show that CASSLE improves over various SSL methods, reaching state-of-the-art performance in multiple downstream tasks. |
Marcin Przewięźlikowski · Mateusz Pyla · Bartosz Zieliński · Bartłomiej Twardowski · Jacek Tabor · Marek Śmieja 🔗 |
-
|
BarcodeBERT: Transformers for Biodiversity Analysis
(
Poster
)
link »
Understanding biodiversity is a global challenge, in which DNA barcodes—short snippets of DNA that cluster by species—play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT |
Pablo Millan Arias · Niousha Sadjadi · Monireh Safari · ZeMing Gong · Austin T. Wang · Scott Lowe · Joakim Bruslund Haurum · Iuliia Zarubiieva · Dirk Steinke · Lila Kari · Angel Chang · Graham Taylor
|
Author Information
Tengda Han (University of Oxford)
Ishan Misra (Facebook AI Research)
Pengtao Xie (UC San Diego)
Mathilde Caron (Google)
Hilde Kuehne (MIT-IBM Watson AI Lab, University of Bonn)
More from the Same Authors
-
2022 : Betty: An Automatic Differentiation Library for Multilevel Optimization »
Sang Choe · Willie Neiswanger · Pengtao Xie · Eric Xing -
2023 : On the Out of Distribution Robustness of Foundation Models in Medical Image Segmentation »
Duy M. H. Nguyen · Tan Ngoc Pham · Nghiem Diep · Nghi Phan · Quang Pham · Vinh Tong · Binh Nguyen · Ngan Le · Nhat Ho · Pengtao Xie · Daniel Sonntag · Mathias Niepert -
2023 Workshop: UniReps: Unifying Representations in Neural Models »
Marco Fumero · Emanuele Rodolà · Francesco Locatello · Gintare Karolina Dziugaite · Mathilde Caron -
2023 Workshop: Associative Memory & Hopfield Networks in 2023 »
Parikshit Ram · Hilde Kuehne · Daniel Lee · Cengiz Pehlevan · Mohammed Zaki · Lenka Zdeborová -
2023 Poster: LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching »
Duy M. H. Nguyen · Hoang Nguyen · Nghiem Diep · Tan Ngoc Pham · Tri Cao · Binh Nguyen · Paul Swoboda · Nhat Ho · Shadi Albarqouni · Pengtao Xie · Daniel Sonntag · Mathias Niepert -
2023 Poster: Learning Human Action Recognition Representations Without Real Humans »
Howard Zhong · Samarth Mishra · Donghyun Kim · SouYoung Jin · Rameswar Panda · Hilde Kuehne · Leonid Karlinsky · Venkatesh Saligrama · Aude Oliva · Rogerio Feris -
2023 Poster: Making Scalable Meta Learning Practical »
Sang Choe · Sanket Vaibhav Mehta · Hwijeen Ahn · Willie Neiswanger · Pengtao Xie · Emma Strubell · Eric Xing -
2023 Poster: Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution »
Mostafa Dehghani · Basil Mustafa · Josip Djolonga · Jonathan Heek · Matthias Minderer · Mathilde Caron · Andreas Steiner · Joan Puigcerver · Robert Geirhos · Ibrahim Alabdulmohsin · Avital Oliver · Piotr Padlewski · Alexey Gritsenko · Mario Lucic · Neil Houlsby -
2023 Poster: What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation »
Benedikt Blumenstiel · Johannes Jakubik · Hilde Kuehne · Michael Vössing -
2022 Workshop: Self-Supervised Learning: Theory and Practice »
Ishan Misra · Pengtao Xie · Gul Varol · Yale Song · Yuki Asano · Xiaolong Wang · Pauline Luc -
2022 Poster: A Data-Augmentation Is Worth A Thousand Samples: Analytical Moments And Sampling-Free Training »
Randall Balestriero · Ishan Misra · Yann LeCun -
2022 Poster: Flamingo: a Visual Language Model for Few-Shot Learning »
Jean-Baptiste Alayrac · Jeff Donahue · Pauline Luc · Antoine Miech · Iain Barr · Yana Hasson · Karel Lenc · Arthur Mensch · Katherine Millican · Malcolm Reynolds · Roman Ring · Eliza Rutherford · Serkan Cabi · Tengda Han · Zhitao Gong · Sina Samangooei · Marianne Monteiro · Jacob L Menick · Sebastian Borgeaud · Andy Brock · Aida Nematzadeh · Sahand Sharifzadeh · Mikołaj Bińkowski · Ricardo Barreira · Oriol Vinyals · Andrew Zisserman · Karén Simonyan -
2022 Poster: Saliency-Aware Neural Architecture Search »
Ramtin Hosseini · Pengtao Xie -
2021 Workshop: 2nd Workshop on Self-Supervised Learning: Theory and Practice »
Pengtao Xie · Ishan Misra · Pulkit Agrawal · Abdelrahman Mohamed · Shentong Mo · Youwei Liang · Jeannette Bohg · Kristina N Toutanova -
2021 Poster: Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers »
Mandela Patrick · Dylan Campbell · Yuki Asano · Ishan Misra · Florian Metze · Christoph Feichtenhofer · Andrea Vedaldi · João Henriques -
2021 Oral: Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers »
Mandela Patrick · Dylan Campbell · Yuki Asano · Ishan Misra · Florian Metze · Christoph Feichtenhofer · Andrea Vedaldi · João Henriques -
2020 Workshop: Self-Supervised Learning -- Theory and Practice »
Pengtao Xie · Shanghang Zhang · Pulkit Agrawal · Ishan Misra · Cynthia Rudin · Abdelrahman Mohamed · Wenzhen Yuan · Barret Zoph · Laurens van der Maaten · Xingyi Yang · Eric Xing -
2020 Poster: Self-supervised Co-Training for Video Representation Learning »
Tengda Han · Weidi Xie · Andrew Zisserman