Timezone: »

Workshop
NeurIPS 2022 Workshop on Meta-Learning
Huaxiu Yao · Eleni Triantafillou · Fabio Ferreira · Joaquin Vanschoren · Qi Lei

Fri Dec 02 07:00 AM -- 04:00 PM (PST) @ Theater C

Recent years have seen rapid progress in meta-learning methods, which transfer knowledge across tasks and domains to efficiently learn new tasks, optimize the learning process itself, and even generate new learning methods from scratch. Meta-learning can be seen as the logical conclusion of the arc that machine learning has undergone in the last decade, from learning classifiers, to learning representations, and finally to learning algorithms that themselves acquire representations, classifiers, and policies for acting in environments. In practice, meta-learning has been shown to yield new state-of-the-art automated machine learning methods, novel deep learning architectures, and substantially improved one-shot learning systems. Moreover, improving one’s own learning capabilities through experience can also be viewed as a hallmark of intelligent beings, and neuroscience shows a strong connection between human and reward learning and the growing sub-field of meta-reinforcement learning.

Some of the fundamental questions that this workshop aims to address are:
- What are the meta-learning processes in nature (e.g., in humans), and how can we take inspiration from them?
- What is the relationship between meta-learning, continual learning, and transfer learning?
- What interactions exist between meta-learning and large pretrained / foundation models?
- What principles can we learn from meta-learning to help us design the next generation of learning systems?
- What kind of theoretical principles can we develop for meta-learning?
- How can we exploit our domain knowledge to effectively guide the meta-learning process and make it more efficient?
- How can we design better benchmarks for different meta-learning scenarios?

As prospective participants, we primarily target machine learning researchers interested in the questions and foci outlined above. Specific target communities within machine learning include, but are not limited to: meta-learning, AutoML, reinforcement learning, deep learning, optimization, evolutionary computation, and Bayesian optimization. We also invite submissions from researchers who study human learning and neuroscience, to provide a broad and interdisciplinary perspective to the attendees.

 Fri 7:00 a.m. - 7:10 a.m. Opening remarks 🔗 Fri 7:10 a.m. - 7:40 a.m. Invited talk: Mengye Ren (Invited talk) 🔗 Fri 7:40 a.m. - 8:10 a.m. Invited talk: Lucas Beyer (Invited talk) 🔗 Fri 8:10 a.m. - 8:25 a.m. Contributed Talk 1: FiT: Parameter Efficient Few-shot Transfer Learning (Contributed Talk) 🔗 Fri 8:25 a.m. - 8:40 a.m. Break (break) 🔗 Fri 8:40 a.m. - 9:40 a.m. Poster session 1 (poster session)  link » 🔗 Fri 9:40 a.m. - 9:55 a.m. Contributed talk 2: Optimistic Meta-Gradients (contributed talk) 🔗 Fri 9:55 a.m. - 10:25 a.m. Invited talk: Elena Gribovskaya (invited talk) 🔗 Fri 10:25 a.m. - 12:00 p.m. Lunch break (break) 🔗 Fri 12:00 p.m. - 12:30 p.m. Invited talk: Chelsea Finn (invited talk) 🔗 Fri 12:30 p.m. - 1:00 p.m. Invited talk: Greg Yang (invited talk) 🔗 Fri 1:00 p.m. - 1:15 p.m. Contributed talk 3: The Curse of Low Task Diversity: On the Failure of Transfer Learning to Outperform MAML and Their Empirical Equivalence (contributed talk) 🔗 Fri 1:15 p.m. - 2:15 p.m. Poster session 2 (poster session) 🔗 Fri 2:15 p.m. - 2:30 p.m. Contributed talk 4: HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks (contributed talk) 🔗 Fri 2:30 p.m. - 3:00 p.m. Invited talk: Percy Liang (invited talk) 🔗 Fri 3:00 p.m. - 3:50 p.m. Discussion panel (discussion panel) 🔗 Fri 3:50 p.m. - 4:00 p.m. Closing remarks 🔗 - LOTUS: Learning to learn with Optimal Transport in Unsupervised Scenarios (Poster) []  []   link »    Automated machine learning has been widely researched and adopted for supervised tasks such as classification and regression. Unsupervised scenarios, lacking a ground truth to optimize on, are much harder to automate. We propose a novel zero-shot meta-learning approach that recommends which algorithms and hyperparameters to use on new unsupervised tasks by learning from prior supervised proxy datasets. Our premise is that the selection of optimal unsupervised algorithms depends on the inherent properties of the data distribution. We first build a large meta-dataset evaluating many algorithms and hyperparameter settings on prior datasets, leverage optimal transport to find the prior datasets with the most similar underlying distribution, and then recommend the (tuned) algorithm that proved to work best for that data distribution. We evaluate the robustness of our approach on one particular task, i.e. outlier detection, and find that it outperforms state of the art methods in unsupervised outlier detection. Link » prabhant singh · Joaquin Vanschoren 🔗 - Test-time adaptation with slot-centric models (Poster) []   link »    We consider the problem of segmenting scenes into constituent objects and their parts. Current supervised visual detectors, though impressive within their training distribution, often fail to segment out-of-distribution scenes into their constituent entities. Recent test-time adaptation methods use auxiliary self-supervised losses to adapt the network parameters to each test example independently and have shown promising results towards generalization outside the training distribution for the task of image classification. In our work, we find evidence that these losses can be insufficient for instance segmentation tasks, without also considering architectural inductive biases. For image segmentation, recent slot-centric generative models break such dependence on supervision by attempting to segment scenes into entities in a self-supervised manner by reconstructing pixels. Drawing upon these two lines of work, we propose Generating Fast and Slow Networks (GFS-Nets), a semi-supervised instance segmentation model equipped with a slot-centric image rendering component that is adapted per scene at test time through gradient descent on reconstruction or novel view synthesis objectives. We show that test-time adaptation greatly improves segmentation in out-of-distribution scenes. We evaluate GFS-Nets in scene segmentation benchmarks and show substantial out-of-distribution performance improvements against state-of-the-art supervised feed forward detectors and self-supervised domain adaptation models. Link » Mihir Prabhudesai · Sujoy Paul · Sjoerd van Steenkiste · Mehdi S. M. Sajjadi · Anirudh Goyal · Deepak Pathak · Katerina Fragkiadaki · Gaurav Aggarwal · Thomas Kipf 🔗 - Meta-Learning Makes a Better Multimodal Few-shot Learner (Poster) []  []   link »    Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. As an effort to bridge this gap, we introduce a meta-learning approach for multimodal few-shot learning, to leverage its strong ability of accruing knowledge across tasks. The full model is based on frozen foundation vision and language models to use their already learned capacity. To translate the visual features into the latent space of the language model, we introduce a light-weight meta-mapper, acting as a meta-learner. By updating only the parameters of the meta-mapper, our model learns to quickly adapt to unseen samples with only a few gradient updates. Unlike prior multimodal few-shot learners, which need a hand-engineered task induction, our model is able to induce the task in a completely data-driven manner. The experiments on recent multimodal few-shot benchmarks demonstrate that our meta-learning approach yields better multimodal few-shot learners while being computationally more efficient compared to its counterparts. Link » Ivona Najdenkoska · Xiantong Zhen · Marcel Worring 🔗 - Efficient Bayesian Learning Curve Extrapolation using Prior-Data Fitted Networks (Poster) []  []   link »    Learning curve extrapolation aims to predict model performance in later epochs of a machine learning training, based on the performance in the first k epochs. In this work, we argue that, while the varying difficulty of extrapolating learning curves warrants a Bayesian approach, existing methods are (i) overly restrictive, and/or (ii) computationally expensive. We describe the first application of prior-data fitted neural networks (PFNs) in this context. PFNs use a transformer, pre-trained on data generated from a prior, to perform approximate Bayesian inference in a single forward pass. We present preliminary results, demonstrating that PFNs can more accurately approximate the posterior predictive distribution multiple orders of magnitude faster than MCMC, as well as obtain a lower average error predicting final accuracy obtained by real learning curve data from LCBench. Link » Steven Adriaensen · Herilalaina Rakotoarison · Samuel Müller · Frank Hutter 🔗 - Adversarial Cheap Talk (Poster) []   link » Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the victim’s parameters, environment, or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary can merely append deterministic messages to the Victim’s observation, resulting in a minimal range of influence. The Adversary cannot occlude ground truth, influence underlying environment dynamics or reward signals, introduce non-stationarity, add stochasticity, see the Victim’s actions, or access their parameters. Additionally, we present a simple meta-learning algorithm called Adversarial Cheap Talk (ACT) to train Adversaries in this setting. We demonstrate that an Adversary trained with ACT can still significantly influence the Victim’s training and testing performance, despite the highly constrained setting. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner’s function approximation, or instead helping the Victim’s performance by outputting useful features. Finally, we show that an ACT Adversary can manipulate messages during train-time to directly and arbitrarily control the Victim at test-time. Link » Chris Lu · Timon Willi · Alistair Letcher · Jakob Foerster 🔗 - Achieving a Better Stability-Plasticity Trade-off via Auxiliary Networks in Continual Learning (Poster) []  []   link »    In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model's performances drop dramatically after being optimized for a new task. Since then, the continual learning community has proposed several solutions aiming to equip the neural network with the ability to learn the current task (plasticity) while still achieving high accuracy on the old tasks (stability). Despite remarkable improvements, the plasticity-stability trade-off is still far from being solved, and its underlying mechanism is poorly understood. In this work, we propose Auxiliary Network Continual Learning (ANCL), a new method that combines the continually learned model with an additional auxiliary network that is solely optimized on the new task. More concretely, the proposed framework materializes in a regularizer that naturally interpolates between plasticity and stability, surpassing strong baselines on CIFAR-100. By analyzing the solutions of several continual learning methods based on the so-called mode connectivity assumption, we propose a new hyperparamter's search technique which dynamically adjust the regularization parameter to achieve better stability-plasticity trade-off. Link » Sanghwan Kim · Lorenzo Noci · Antonio Orvieto · Thomas Hofmann 🔗 - Optimistic Meta-Gradients (Poster) []  []   link » We study the connection between gradient-based meta-learning and convex optimisation. We observe that gradient descent with momentum is as a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta-learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor,it is not sufficient for acceleration. Instead, some form of optimism is required. We show that optimism in meta-learning can be captured through the recently proposed Bootstrapped Meta-Gradient method, providing deeper insight into its underlying mechanics. Link » Sebastian Flennerhag · Tom Zahavy · Brendan O'Donoghue · Hado van Hasselt · András György · Satinder Singh 🔗 - Transfer NAS with Meta-learned Bayesian Surrogates (Poster) []  []   link »    While neural architecture search (NAS) is an intensely-researched area, approaches typically still suffer from either (i) high computational costs or (ii) lack of robustness across datasets and experiments. Furthermore, most methods start searching for an optimal architecture from scratch, ignoring prior knowledge. This is in contrast to the manual design process by researchers and engineers that leverage previous deep learning experiences by, e.g., transferring architectures from previously solved, related problems.We propose to adopt this human design strategy and introduce a novel surrogate for NAS, that is meta-learned across prior architecture evaluations across different datasets. We utilize Bayesian Optimization (BO) with deep-kernel Gaussian Processes, graph neural networks for the architecture embeddings and a transformer-based set encoder of datasets. As a result, our method consistently achieves state-of-the-art results on six computer vision datasets, while being as fast as one-shot NAS methods. Link » Gresa Shala · Thomas Elsken · Frank Hutter · Josif Grabocka 🔗 - Gray-Box Gaussian Processes for Automated Reinforcement Learning (Poster) []  []   link »    Despite having achieved spectacular milestones in an array of important real-world applications, most Reinforcement Learning (RL) methods are very brittle concerning their hyperparameters. Notwithstanding the crucial importance of setting the hyperparameters in training state-of-the-art agents, the task of hyperparameter optimization (HPO) in RL is understudied. In this paper, we propose a novel gray-box Bayesian Optimization technique for HPO in RL, that enriches Gaussian Processes with reward curve estimations based on generalized logistic functions. We thus about the performance of learning algorithms, transferring information across configurations and about epochs of the learning algorithm. In a very large-scale experimental protocol, comprising 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), 22 environments (OpenAI Gym: Mujoco, Atari, Classic Control), and 7 HPO baselines, we demonstrate that our method significantly outperforms current HPO practices in RL. Link » Gresa Shala · André Biedenkapp · Frank Hutter · Josif Grabocka 🔗 - AutoRL-Bench 1.0 (Poster) []  []   link »    It is well established that Reinforcement Learning (RL) is very brittle and sensitive to the choice of hyperparameters. This prevents RL methods from being usable out of the box.The field of automated RL (AutoRL) aims at automatically configuring the RL pipeline, to both make RL usable by a broader audience, as well as reveal its full potential. Still, there has been little progress towards this goal as new AutoRL methods often are evaluated with incompatible experimental protocols.Furthermore, the typically high cost of experimentation prevents a thorough and meaningful comparison of different AutoRL methods or established hyperparameter optimization (HPO) methods from the automated Machine Learning (AutoML) community.To alleviate these issues, we propose the first tabular AutoRL Benchmark for studying the hyperparameters of RL algorithms. We consider the hyperparameter search spaces of five well established RL methods (PPO, DDPG, A2C, SAC, TD3) across 22 environments for which we compute and provide the reward curves. This enables HPO methods to simply query our benchmark as a lookup table, instead of actually training agents. Thus, our benchmark offers a testbed for very fast, fair, and reproducible experimental protocols for comparing future black-box, gray-box, and online HPO methods for RL. Link » Gresa Shala · Sebastian Pineda Arango · André Biedenkapp · Frank Hutter · Josif Grabocka 🔗 - PersA-FL: Personalized Asynchronous Federated Learning (Poster) []  []   link » We study the personalized federated learning problem under asynchronous updates. In this problem, each client seeks to obtain a personalized model that simultaneously outperforms local and global models. We consider two optimization-based frameworks for personalization: (i) Model-Agnostic Meta-Learning (MAML) and (ii) Moreau Envelope (ME). MAML involves learning a joint model adapted for each client through fine-tuning, whereas ME requires a bi-level optimization problem with implicit gradients to enforce personalization via regularized losses. We focus on improving the scalability of personalized federated learning by removing the synchronous communication assumption. Moreover, we extend the studied function class by removing boundedness assumptions on the gradient norm. Our main technical contribution is a unified proof for asynchronous federated learning with bounded staleness that we apply to MAML and ME personalization frameworks. For the smooth and non-convex functions class, we show the convergence of our method to a first-order stationary point. We illustrate the performance of our method and its tolerance to staleness through experiments for classification tasks over heterogeneous datasets. Link » M. Taha Toghani · Soomin Lee · Cesar Uribe 🔗 - Bayesian Optimization with a Neural Network Meta-learned on Synthetic Data Only (Poster) []  []   link »    Bayesian Optimization (BO) is an effective approach to optimize black-box functions, relying on a probabilistic surrogate to model the response surface. In this work, we propose to use a Prior-data Fitted Network (PFN) as a cheap and flexible surrogate. PFNs are neural networks that approximate the Posterior Predictive Distribution (PPD) in a single forward-pass. Most importantly, they can approximate the PPD for any prior distribution that we can sample from efficiently. Additionally, we show what is required for PFNs to be used in a standard BO setting with common acquisition functions. We evaluated the performance of a PFN surrogate for Hyperparameter optimization (HPO), a major application of BO. While the method can still fail for some search spaces, we fare comparable or better than the state-of-the-art on the HPO-B and PD1 benchmark. Link » Samuel Müller · Sebastian Pineda Arango · Matthias Feurer · Josif Grabocka · Frank Hutter 🔗 - Recommendation for New Drugs with Limited Prescription Data (Poster) []   link » Drug recommendation assists doctors in prescribing personalized medications to patients based on their health conditions. However, newly approved drugs do not have much historical prescription data and cannot leverage existing drug recommendation methods. To address this, we propose EDGE, which maintains a drug-dependent multi-phenotype few-shot learner to bridge the gap between existing and new drugs. Experiment results show that EDGE can adapt to the recommendation for a new drug with limited prescription data from a few patients. Link » Zhenbang Wu · Huaxiu Yao · Zhe Su · David Liebovitz · Lucas Glass · James Zou · Chelsea Finn · Jimeng Sun 🔗 - Towards Automated Design of Bayesian Optimization via Exploratory Landscape Analysis (Poster) []   link »    Bayesian optimization (BO) algorithms form a class of surrogate-based heuristics, aimed at efficiently computing high-quality solutions for numerical black-box optimization problems. The BO pipeline is highly modular, with different design choices for the initial sampling strategy, the surrogate model, the acquisition function (AF), the solver used to optimize the AF, etc. We demonstrate in this work that a dynamic selection of the AF can benefit the BO design. More precisely, we show that already a naive random forest regression model, built on top of exploratory landscape analysis features that are computed from the initial design points, suffices to recommend AFs that outperform any static choice, when considering performance over the classic BBOB benchmark suite for derivative-free numerical optimization methods on the COCO platform. Our work hence paves a way towards AutoML-assisted, on-the-fly BO designs that adjust their behavior on a run-by-run basis. Link » Carolin Benjamins · Anja Jankovic · Elena Raponi · Koen van der Blom · Marius Lindauer · Carola Doerr 🔗 - One-Shot Optimal Design for Gaussian Process Analysis of Randomized Experiments (Poster) []  []   link »    Bayesian Optimization provides a sample-efficient approach to optimize Internet systems that are evaluated with randomized experiments. Such evaluations are often resource- and time- consuming in order to measure noisy and long-term outcomes. Thus, the initial randomized design, i.e. determining number of test groups and sample sizes, plays a critical role in building an accurate Gaussian Process model to optimize efficiently and decreasing experimentation cost. We develop a simulation-based method with meta-learned priors to decide the optimal design for the initial batch of GP-modeled randomized experiments. The meta-learning is performed on a large corpus of randomized experiments conducted at Meta and obtains sensible GP priors for simulating across different designs. The one-shot optimal design policy is derived by training a machine learning model with simulation data to map experiment characteristics to an optimal design. Our evaluations show that our proposed optimal design significantly improves resource-efficiency while achieving a target GP model accuracy. Link » Jelena Markovic · Qing Feng · Eytan Bakshy 🔗 - Learning to Prioritize Planning Updates in Model-based Reinforcement Learning (Poster) []  []   link »    Prioritizing the states and actions from which policy improvement is performed can improve the sample efficiency of model-based reinforcement learning systems. Although much is already known about prioritizing planning updates, more needs to be understood to operationalize these ideas in complex settings that involve non-stationary and stochastic transition dynamics, large numbers of states, and scalable function approximation architectures. Our paper presents an online meta-learning algorithm to address these needs. The algorithm finds distributions that encode priority in their probability mass. The paper evaluates the algorithm in a domain with a changing goal and with a fixed, generative transition model. Results show that prioritizing planning updates from samples of the meta-learned distribution significantly improves sample efficiency over fixed baseline distributions. Additionally, they point to a number of interesting opportunities for future research. Link » Brad Burega · John Martin · Michael Bowling 🔗 - GraViT-E: Gradient-based Vision Transformer Search with Entangled Weights (Poster) []  []   link »    Differentiable one-shot neural architecture search methods have recently become popular since they can exploit weight-sharing to efficiently search in large architectural search spaces. These methods traditionally perform a continuous relaxation of the discrete search space to search for an optimal architecture. However, they suffer from large memory requirements, making their application to parameter-heavy architectures like transformers difficult. Recently, single-path one-shot methods have been introduced which often use weight entanglement to alleviate this issue by sampling the weights of the sub-networks from the largest model, which is itself the supernet. In this work, we propose a continuous relaxation of weight entanglement-based architectural representation. Our Gradient-based Vision Transformer Search with Entangled Weights (GraViT-E) combines the best properties of both differentiable one-shot NAS and weight entanglement. We observe that our method imparts much better regularization properties and memory efficiency to the trained supernet. We study three one-shot optimizers on the Vision Transformer search space and observe that our method outperforms existing baselines on multiple datasets while being upto 35% more parameter efficient on ImageNet-1k. Link » Rhea Sukthanker · Arjun Krishnakumar · sharat patil · Frank Hutter 🔗 - Expanding the Deployment Envelope of Behavior Prediction via Adaptive Meta-Learning (Poster) []   link »    Learning-based behavior prediction methods are increasingly being deployed in real-world autonomous systems, e.g., in fleets of self-driving vehicles, which are beginning to commercially operate in major cities across the world. Despite their advancements, however, the vast majority of prediction systems are specialized to a set of well-explored geographic regions or operational design domains, complicating deployment to additional cities, countries, or continents. Towards this end, we present a novel method for efficiently adapting behavior prediction models to new environments. Our approach leverages recent advances in meta-learning, specifically Bayesian regression, to augment existing behavior prediction models with an adaptive layer that enables efficient domain transfer via offline fine-tuning, online adaptation, or both. Experiments across multiple real-world datasets demonstrate that our method can efficiently adapt to a variety of unseen environments. Link » Boris Ivanovic · James Harrison · Marco Pavone 🔗 - PriorBand: HyperBand + Human Expert Knowledge (Poster) []  []   link »    Hyperparameters of Deep Learning (DL) pipelines are crucial for their performance. While a large number of methods for hyperparameter optimization (HPO) have been developed, they are misaligned with the desiderata of a modern DL researcher. Since often only a few trials are possible in the development of new DL methods, manual experimentation is still the most prevalent approach to set hyperparameters,relying on the researcher’s intuition and cheap preliminary explorations. To resolve this shortcoming of HPO for DL, we propose PriorBand, an HPO algorithm tailored to DL, able to utilize both expert beliefs and cheap proxy tasks. Empirically, we demonstrate the efficiency of PriorBand across a range of DL models and tasks using as little as the cost of 10 training runs and show its robustness against poor expert beliefs and misleading proxy tasks. Link » Neeratyoy Mallik · Carl Hvarfner · Danny Stoll · Maciej Janowski · Edward Bergman · Marius Lindauer · Luigi Nardi · Frank Hutter 🔗 - The Curse of Low Task Diversity: On the Failure of Transfer Learning to Outperform MAML and Their Empirical Equivalence (Poster) []  []   link »    Recently, it has been observed that a transfer learning solution might be all we need to solve many few-shot learning benchmarks -- thus raising important questions about when and how meta-learning algorithms should be deployed. In this paper, we seek to clarify these questions by 1. proposing a novel metric -- the {\it diversity coefficient} -- to measure the diversity of tasks in a few-shot learning benchmark and 2. by comparing Model-Agnostic Meta-Learning (MAML) and transfer learning under fair conditions (same architecture, same optimizer, and all models trained to convergence).Using the diversity coefficient, we show that the popular MiniImageNet and CIFAR-FS few-shot learning benchmarks have low diversity. This novel insight contextualizes claims that transfer learning solutions are better than meta-learned solutions in the regime of low diversity under a fair comparison. Specifically, we empirically find that a low diversity coefficient correlates with a high similarity between transfer learning and MAML learned solutions in terms of accuracy at meta-test time and classification layer similarity (using feature based distance metrics like SVCCA, PWCCA, CKA, and OPD). To further support our claim, we find this meta-test accuracy holds even as the model size changes. Therefore, we conclude that in the low diversity regime, MAML and transfer learning have equivalent meta-test performance when both are compared fairly.We also hope our work inspires more thoughtful constructions and quantitative evaluations of meta-learning benchmarks in the future. Link » Brando Miranda · Patrick Yu · Yu-Xiong Wang · Sanmi Koyejo 🔗 - Towards Discovering Neural Architectures from Scratch (Poster) []  []   link »    The discovery of neural architectures from scratch is the long-standing goal of Neural Architecture Search (NAS). Searching over a wide spectrum of neural architectures can facilitate the discovery of previously unconsidered but well-performing architectures. In this work, we take a large step towards discovering neural architectures from scratch by expressing architectures algebraically. This algebraic view leads to a more general method for designing search spaces, which allows us to compactly represent search spaces that are 100s of orders of magnitude larger than common spaces from the literature. Further, we propose a Bayesian Optimization strategy to efficiently search over such huge spaces, and demonstrate empirically that both our search space design and our search strategy can be superior to existing baselines. We open source our algebraic NAS approach and provide APIs for PyTorch and TensorFlow. Link » Simon Schrodi · Danny Stoll · Robin Ru · Rhea Sukthanker · Thomas Brox · Frank Hutter 🔗 - HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks (Poster) []  []   link »    Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals. Recent applications of INRs include image super-resolution, compression of high-dimensional signals, or 3D rendering. However, these solutions usually focus on visual data, and adapting them to the audio domain is not trivial. Moreover, it requires a separately trained model for every data sample. To address this limitation, we propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time. We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models. Link » Filip Szatkowski · Karol J. Piczak · Przemysław Spurek · Jacek Tabor · Tomasz Trzcinski 🔗 - On the Importance of Architectures and Hyperparameters for Fairness in Face Recognition (Poster) []  []   link »    Face recognition systems are used widely but are known to exhibit bias across a range of sociodemographic dimensions, such as gender and race. An array of works proposing pre-processing, training, and post-processing methods have failed to close these gaps. Here, we take a very different approach to this problem, identifying that both architectures and hyperparameters of neural networks are instrumental in reducing bias. We first run a large-scale analysis of the impact of architectures and training hyperparameters on several common fairness metrics and show that the implicit convention of choosing high-accuracy architectures may be suboptimal for fairness. Motivated by our findings, we run the first neural architecture search for fairness, jointly with a search for hyperparameters. We output a suite of models which Pareto-dominate all other competitive architectures in terms of accuracy and fairness. Furthermore, we show that these models transfer well to other face recognition datasets with similar and distinct protected attributes. We release our code and raw result files so that researchers and practitioners can replace our fairness metrics with a bias measure of their choice. Link » Samuel Dooley · Rhea Sukthanker · John Dickerson · Colin White · Frank Hutter · Micah Goldblum 🔗 - Few-Shot Calibration of Set Predictors via Meta-Learned Cross-Validation-Based Conformal Prediction (Poster) []  []   link »    Conventional frequentist learning is known to yield poorly calibrated models that fail to reliably quantify the uncertainty of their decisions. Bayesian learning can improve calibration, but formal guarantees apply only under restrictive assumptions about correct model specification. Conformal prediction (CP) offers a general framework for the design of set predictors with calibration guarantees that hold regardless of the underlying data generation mechanism. However, when training data are limited, CP tends to produce large, and hence uninformative, predicted sets. This paper introduces a novel meta-learning solution that aims at reducing the set prediction size. Unlike prior work, the proposed meta-learning scheme, referred to as meta-XB, (i) builds on cross-validation-based CP, rather than the less efficient validation-based CP; and (ii) preserves formal per-task calibration guarantees, rather than less stringent task-marginal guarantees. Link » Sangwoo Park · Kfir M. Cohen · Osvaldo Simeone 🔗 - Multi-objective Tree-structured Parzen Estimator Meets Meta-learning (Poster) []  []   link »    Hyperparameter optimization (HPO) is essential for the better performance of deep learning, and practitioners often need to consider the trade-off between multiple metrics, such as error rate, latency, memory requirements, robustness, and algorithmic fairness. Due to this demand and the heavy computation of deep learning, the acceleration of multi-objective (MO) optimization becomes ever more important. Although meta-learning has been extensively studied to speedup HPO, existing methods are not applicable to the MO tree-structured parzen estimator (MO-TPE), a simple yet powerful MO HPO algorithm. In this paper, we extend TPE’s acquisition function to the meta-learning setting, using a task similarity defined by the overlap in promising regions of each task. In a comprehensive set of experiments, we demonstrate that our method accelerates MO-TPE on tabular HPO benchmarks and yields state-of-the-art performance. Our method was also validated externally by winning the AutoML 2022 competition on "Multiobjective Hyperparameter Optimization for Transformers". Link » Shuhei Watanabe · Noor Awad · Masaki Onishi · Frank Hutter 🔗 - Unsupervised Meta-learning via Few-shot Pseudo-supervised Contrastive Learning (Poster) []  []   link »    Unsupervised meta-learning aims to learn generalizable knowledge across a distribution of tasks constructed from unlabeled data. Here, the main challenge is how to construct diverse tasks for meta-learning without label information; recent works have proposed to create, e.g., pseudo-labeling via pretrained representations or creating synthetic samples via generative models. However, such a task construction strategy is fundamentally limited due to heavy reliance on the immutable pseudo-labels during meta-learning and the quality of the representations or the generated samples. To overcome the limitations, we propose a simple yet effective unsupervised meta-learning framework, coined Pseudo-supervised Contrast (PsCo), for few-shot classification. We are inspired by the recent self-supervised learning literature; PsCo utilizes a momentum network and a queue of previous batches to improve pseudo-labeling and construct diverse tasks in a progressive manner. Our extensive experiments demonstrate that PsCo outperforms existing unsupervised meta-learning methods under various in-domain and cross-domain few-shot classification benchmarks. We also validate that PsCo is easily scalable to a large-scale benchmark, while recent prior-art meta-schemes are not. Link » Huiwon Jang · Hankook Lee · Jinwoo Shin 🔗 - Uncertainty-Aware Meta-Learning for Multimodal Task Distributions (Poster) []   link »    Meta-learning is a popular approach for learning new tasks with limited data (i.e., few-shot learning) by leveraging the commonalities among different tasks. However, meta-learned models can perform poorly when context data is limited, or when data is drawn from an out-of-distribution (OoD) task. Especially in safety-critical settings, this necessitates an uncertainty-aware approach to meta-learning. In this work, we present UNLIMITD (uncertainty-aware meta-learning for multimodal6 task distributions), a novel method for meta-learning that (1) makes probabilistic predictions on in-distribution tasks efficiently, (2) is capable of detecting OoD context data at test time, and (3) performs on heterogeneous, multimodal task distributions. To achieve this goal, we take a probabilistic perspective and train a parametric, tuneable distribution over tasks on the meta-dataset. We construct this distribution by performing Bayesian inference on a linearized neural network, leveraging Gaussian process theory. We demonstrate that UNLIMITD's predictions compare favorably to, and outperform in most cases, the standard baselines, especially in the low-data regime. Furthermore, we show that UNLIMITD is effective in detecting data from OoD tasks. Finally, we confirm that both of these findings continue to hold in the multimodal task-distribution setting. Link » Cesar Almecija · Apoorva Sharma · Young-Jin Park · Navid Azizan 🔗 - Lightweight Prompt Learning with General Representation for Rehearsal-free Continual Learning (Poster) []   link »    Recently, the prompt-based continual learning has become a new state-of-the-art by using small prompts to induce a large pre-trained model toward each target task. However, we figure out that they still suffer from memory problem as the number of prompts should increase if the model learns very many tasks. To improve this limit, inspired by the human hippocampus, we propose Lightweight Prompt Learning with General Representation (LPG), a novel rehearsal-free continual learning method. Throughout the study, we experimentally show our LPG's promising performances and corresponding analyses. We expect our proposition to spotlight a novel continual learning paradigm that utilizes a single prompt to hedge memory problems as well as sustain precise performance. Link » Hyunhee Chung · Kyung Ho Park 🔗 - Meta-RL for Multi-Agent RL: Learning to Adapt to Evolving Agents (Poster) []  []   link »    In Multi-Agent RL, agents learn and evolve together, and each agent has to interact with a changing set of other agents. While generally viewed as a problem of non-stationarity, we propose that this can be viewed as a Meta-RL problem. We demonstrate an approach for learning Stackelberg equilibria, a type of equilibrium that features a bi-level optimization problem, where the inner level is a "best-response" of one or more follower agents to an evolving leader agent. Various approaches have been proposed in the literature to implement this best-response, most often treating each leader policy and the learning problem it induces for the follower(s) as a separate instance.We propose that the problem can be viewed as a meta (reinforcement) learning problem: Learning to learn to best-respond to different leader behaviors, by leveraging commonality in the induced follower learning problems. We demonstrate an approach using contextual policies and show that it matches performance of existing approaches using significantly fewer environment samples in experiments. We discuss how more advanced meta-RL techniques could allow this to scale to richer domains. Link » Matthias Gerstgrasser · David Parkes 🔗 - Neural Architecture for Online Ensemble Continual Learning (Poster) []  []   link »    Continual learning with an increasing number of classes is a challenging task. The difficulty rises when each example is presented exactly once, which requires the model to learn online. Recent methods with classic parameter optimization procedures have been shown to struggle in such setups or have limitations like non-differentiable components or memory buffers. For this reason, we present the fully differentiable ensemble method that allows us to efficiently train an ensemble of neural networks in the end-to-end regime. The proposed technique achieves SOTA results without a memory buffer and clearly outperforms the reference methods. The conducted experiments have also shown a significant increase in the performance for small ensembles, which demonstrates the capability of obtaining relatively high classification accuracy with a reduced number of classifiers. Link » Mateusz Wójcik · Witold Kościukiewicz · Adam Gonczarek · Tomasz Kajdanowicz 🔗 - Meta-Learning via Classifier(-free) Guidance (Poster) []  []   link »    We aim to develop meta-learning techniques that achieve higher zero-shot performance than the state of the art on unseen tasks. To do so, we take inspiration from recent advances in generative modeling and language-conditioned image synthesis to propose meta-learning techniques that use natural language guidance for zero-shot task adaptation. We first train an unconditional generative hypernetwork model to produce neural network weights; then we train a second "guidance" model that, given a natural language task description, traverses the hypernetwork latent space to find high-performance task-adapted weights in a zero-shot manner. We explore two alternative approaches for latent space guidance: "HyperCLIP"-based classifier guidance and a conditional Hypernetwork Latent Diffusion Model ("HyperLDM"), which we show to benefit from the classifier-free guidance technique common in image generation. Finally, we demonstrate that our approaches outperform existing meta-learning methods with zero-shot learning experiments on our Meta-VQA dataset. Link » Elvis Nava · Seijin Kobayashi · Yifei Yin · Robert Katzschmann · Benjamin F. Grewe 🔗 - MARS: Meta-learning as score matching in the function space (Poster) []  []   link »    We approach meta-learning through the lens of functional Bayesian neural network inference which views the prior as a stochastic process and performs inference in the function space. Specifically, we view the meta-training tasks as samples from the data-generating process and formalize meta-learning as empirically estimating the law of this stochastic process. Our approach can seamlessly acquire and represent complex prior knowledge by meta-learning the score function of the data-generating process marginals. In a comprehensive benchmark, we demonstrate that our method achieves state-of-the-art performance in terms of predictive accuracy and substantial improvements in the quality of uncertainty estimates. Link » Kruno Lehman · Jonas Rothfuss · Andreas Krause 🔗 - Debiasing Meta-Gradient Reinforcement Learning by Learning the Outer Value Function (Poster) []  []   link »    Meta-gradient Reinforcement Learning (RL) allows agents to self-tune their hyper-parameters in an online fashion during training.In this paper, we identify a bias in the meta-gradient of current meta-gradient RL approaches.This bias comes from using the critic that is trained using the meta-learned discount factor for the advantage estimation in the outer objective which requires a different discount factor.Because the meta-learned discount factor is typically lower than the one used in the outer objective, the resulting bias may cause the meta-gradient to favor myopic policies.We propose a simple solution to this issue: we alleviate this bias by using an alternative, \emph{outer} value function in the estimation of the outer loss. To obtain this outer value function we add a second head to the critic network and train it alongside the classic critic, using the outer loss discount factor.On an illustrative toy problem, we show that the bias can cause catastrophic failure of current meta-gradient RL approaches, and show that our proposed solution fixes it.We then apply our method to more complex environments and demonstrate that fixing the meta-gradient bias significantly improves performance. Link » Clément Bonnet · Laurence Midgley · Alexandre Laterre 🔗 - GramML: Exploring Context-Free Grammars with Model-Free Reinforcement Learning (Poster) []  []   link »    One concern of AutoML systems is how to discover the best pipeline configuration to solve a particular task in the shortest amount of time. Recent approaches tackle the problem using techniques based on learning a model that helps relate the configuration space and the objective being optimized. However, relying on such a model poses some difficulties. First, both pipelines and datasets have to be represented with meta-features. Second, there exists a strong dependence on the chosen model and its hyperparameters. In this paper, we present a simple yet effective model-free reinforcement learning approach based on an adaptation of the Monte Carlo tree search (MCTS) algorithm for trees and context-free grammars. We run experiments on the OpenML-CC18 benchmark suite and show superior performance compared to the state-of-the-art. Link » Hernan C. Vazquez · Jorge Sanchez · Rafael Carrascosa 🔗 - Efficient Queries Transformer Neural Processes (Poster) []  []   link »    Neural Processes (NPs) are popular methods in meta-learning that can estimate predictive uncertainty on target datapoints by conditioning on a context dataset. Previous state-of-the-art method Transformer Neural Processes (TNPs) achieve strong performance but require quadratic computation with respect to the number of context datapoints per query, limiting its applications. Conversely, existing sub-quadratic NP variants perform significantly worse than that of TNPs. Tackling this issue, we propose Efficient Queries Transformer Neural Processes (EQTNPs), a more computationally efficient NP variant. The model encodes the context dataset into a set of vectors that is linear in the number of context datapoints. When making predictions, the model retrieves higher-order information from the context dataset via multiple cross-attention mechanisms on the context vectors. We empirically show that EQTNPs achieve results competitive with the state-of-the-art. Link » Leo Feng · Hossein Hajimirsadeghi · Yoshua Bengio · Mohamed Osama Ahmed 🔗 - Meta-learning of Black-box Solvers Using Deep Reinforcement Learning (Poster) []  []   link »    Black-box optimization does not require any specification on the function we are looking to optimize. As such, it represents one of the most general problems in optimization, and is central in many scientific areas. However in many practical cases, one must solve a sequence of black-box problems from functions originating from a specific class and hence sharing similar patterns. Classical algorithms such as evolutionary or random methods would treat each problem independently and would be oblivious of the general underlying structure. In this paper, we introduce MELBA, an algorithm that exploits the similarities among a given class of functions to learn a task-specific solver that is tailored to efficiently optimize every function from this task. More precisely, given a class of functions, the proposed algorithm learns a Transformer-based Reinforcement Learning (RL) black-box solver. First, the Transformer embeds a previously gathered set of evaluation points and their image through the function into a latent state that characterizes the current stage of the optimization process. Then, the next evaluation point is sampled according to the latent state. The black-box solver is trained using PPO and the global regret on a training set. We show experimentally the effectiveness of our solvers on various synthetic and real-life tasks including the hyperparameter optimization of ML models (SVM, XGBoost) and demonstrate that our approach is competitive with existing methods. Link » Cedric Malherbe · Aladin Virmaux · Ludovic Dos Santos · Sofian Chaybouti 🔗 - Contextual Squeeze-and-Excitation (Poster) []  []   link »    Several applications require effective knowledge transfer across tasks in the low-data regime. For instance in personalization a pretrained system is adapted by learning on small amounts of labeled data belonging to a specific user (context). This setting requires high accuracy under low computational complexity, meaning low memory footprint in terms of parameters storage and adaptation cost. Meta-learning methods based on Feature-wise Linear Modulation generators (FiLM) satisfy these constraints as they can adapt a backbone without expensive fine-tuning. However, there has been limited research on viable alternatives to FiLM generators. In this paper we focus on this area of research and propose a new adaptive block called Contextual Squeeze-and-Excitation (CaSE). CaSE is more efficient than FiLM generators for a variety of reasons: it does not require a separate set encoder, has fewer learnable parameters, and only uses a scale vector (no shift) to modulate activations. We empirically show that CaSE is able to outperform FiLM generators in terms of parameter efficiency (a 75% reduction in the number of adaptation parameters) and classification accuracy (a 1.5% average improvement on the 26 datasets of the VTAB+MD benchmark). Link » Massimiliano Patacchiola · John Bronskill · Aliaksandra Shysheya · Katja Hofmann · Sebastian Nowozin · Richard Turner 🔗 - Conditional Neural Processes for Molecules (Poster) []   link »    Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in QSAR modelling, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification. Link » Miguel Garcia-Ortegon · Andreas Bender · Sergio Bacallado 🔗 - Meta-Learning General-Purpose Learning Algorithms with Transformers (Poster) []   link »    Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general purpose learning algorithms from scratch, using only black box models with minimal inductive bias. A general purpose learning algorithm is one which takes in training data, and produces test-set predictions across a wide range of problems, without any explicit definition of an inference model, training loss, or optimization algorithm. In this paper we show that Transformers and other black-box models can be meta-trained to act as general purpose learning algorithms, and can generalize to learn on different datasets than used during meta-training. We characterize phase transitions between algorithms that generalize, algorithms that memorize, and algorithms that fail to meta-train at all, induced by changes in model size, number of tasks used during meta-training, and meta-optimization hyper-parameters. We further show that the capabilities of meta-trained algorithms are bottlenecked by the accessible state size (memory) determining the next prediction, unlike standard models which are thought to be bottlenecked by parameter count. Link » Louis Kirsch · Luke Metz · James Harrison · Jascha Sohl-Dickstein 🔗 - Betty: An Automatic Differentiation Library for Multilevel Optimization (Poster) []   link »    Gradient-based multilevel optimization (MLO) has gained attention as a framework for studying numerous problems, ranging from hyperparameter optimization and meta-learning to neural architecture search and reinforcement learning. However, gradients in MLO, which are obtained by composing best-response Jacobians via the chain rule, are notoriously difficult to implement and memory/compute intensive. We take an initial step towards closing this gap by introducing Betty, a software library for large-scale MLO. At its core, we devise a novel dataflow graph for MLO, which allows us to (1) develop efficient automatic differentiation for MLO that reduces the computational complexity from $\mathcal{O}(d^3)$ to $\mathcal{O}(d^2)$, (2) incorporate systems support such as mixed-precision and data-parallel training for scalability, and (3) facilitate implementation of MLO programs of arbitrary complexity while allowing a modular interface for diverse algorithmic and systems design choices. We empirically demonstrate that Betty can be used to implement an array of MLO programs, while also observing up to 11% increase in test accuracy, 14% decrease in GPU memory usage, and 20% decrease in training wall time over existing implementations on multiple benchmarks. We also showcase that Betty enables scaling MLO to models with hundreds of millions of parameters. Link » Sang Keun Choe · Willie Neiswanger · Pengtao Xie · Eric Xing 🔗 - FiT: Parameter Efficient Few-shot Transfer Learning (Poster) []  []   link »    Model parameter efficiency is key for enabling few-shot learning, inexpensive model updates for personalization, and communication efficient federated learning. In this work, we develop FiLM Transfer (FiT) which combines ideas from transfer learning (fixed pretrained backbones and fine-tuned FiLM adapter layers) and meta-learning (automatically configured Naive Bayes classifiers and episodic training) to yield parameter efficient models with superior classification accuracy at low-shot. We experiment with FiT on a range of downstream datasets and show that it achieves better classification accuracy than the leading Big Transfer (BiT) algorithm at low-shot and achieves state-of-the art accuracy on the challenging VTAB-1k benchmark, with fewer than 1% of the updateable parameters. Link » Aliaksandra Shysheya · John Bronskill · Massimiliano Patacchiola · Sebastian Nowozin · Richard Turner 🔗 - Topological Continual Learning with Wasserstein Distance and Barycenter (Poster) []  []   link »    Continual learning in neural networks suffers from a phenomenon called catastrophic forgetting, in which a network quickly forgets what was learned in a previous task. The human brain, however, is able to continually learn new tasks and accumulate knowledge throughout life. Neuroscience findings suggest that continual learning success in the human brain is potentially associated with its modular structure and memory consolidation mechanisms. In this paper we propose a novel topological regularization that penalizes cycle structure in a neural network during training using principled theory from persistent homology and optimal transport. The penalty encourages the network to learn modular structure during training. The penalization is based on the closed-form expressions of the Wasserstein distance and barycenter for the topological features of a 1-skeleton representation for the network. Our topological continual learning method combines the proposed regularization with a tiny episodic memory to mitigate forgetting. We demonstrate that our method is effective in both shallow and deep network architectures for multiple image classification datasets. Link » Tananun Songdechakraiwut · Xiaoshuang Yin · Barry Van Veen 🔗 - Multiple Modes for Continual Learning (Poster) []   link » Adapting model parameters to incoming streams of data is a crucial factor to deep learning scalability. Interestingly, prior continual learning strategies in online settings inadvertently anchor their updated parameters to a local parameter subspace to remember old tasks, else drift away from the subspace and forget. From this observation, we formulate a trade-off between constructing multiple parameter modes and allocating tasks per mode. Mode-Optimized Task Allocation (MOTA), our contributed adaptation strategy, trains multiple modes in parallel, then optimizes task allocation per mode. We empirically demonstrate improvements over baseline continual learning strategies and across varying distribution shifts, namely sub10 population, domain, and task shift. Link » Siddhartha Datta · Nigel Shadbolt 🔗 - Interpolating Compressed Parameter Subspaces (Poster) []   link » Though distribution shifts have caused growing concern for machine learning scalability, solutions tend to specialize towards a specific type of distribution shift. We learn that constructing a Compressed Parameter Subspaces (CPS), a geometric structure representing distance-regularized parameters mapped to a set of train-time distributions, can maximize average accuracy over a broad range of distribution shifts concurrently. We show sampling parameters within a CPS can mitigate backdoor, adversarial, permutation, stylization and rotation perturbations. Regularizing a hypernetwork with CPS can also reduce task forgetting. Link » Siddhartha Datta · Nigel Shadbolt 🔗 - HARRIS: Hybrid Ranking and Regression Forests for Algorithm Selection (Poster) []  []   link »    It is well known that different algorithms perform differently well on an instance of an algorithmic problem, motivating algorithm selection (AS): Given an instance of an algorithmic problem, which is the most suitable algorithm to solve it? As such, the AS problem has received considerable attention resulting in various approaches -- many of which either solve a regression or ranking problem under the hood. Although both of these formulations yield very natural ways to tackle AS, they have considerable weaknesses. On the one hand, correctly predicting the performance of an algorithm on an instance is a sufficient, but not a necessary condition to produce a correct ranking over algorithms and in particular ranking the best algorithm first. On the other hand, classical ranking approaches often do not account for concrete performance values available in the training data, but only leverage rankings composed from such data. We propose HARRIS - Hybrid rAnking and RegRessIon foreSts - a new algorithm selector leveraging special forests, combining the strengths of both approaches while alleviating their weaknesses. HARRIS' decisions are based on a forest model, whose trees are created based on splits optimized on a hybrid ranking and regression loss function. As our preliminary experimental study on ASLib shows, HARRIS improves over standard algorithm selection approaches on some scenarios showing that combining ranking and regression in trees is indeed promising for AS. Link » Lukas Fehring · Jonas Hanselle · Alexander Tornede 🔗