## Workshop: Meta-Learning

### Jane Wang, Joaquin Vanschoren, Erin Grant, Jonathan Schwarz, Francesco Visin, Jeff Clune, Roberto Calandra

2020-12-11T03:00:00-08:00 - 2020-12-11T12:00:00-08:00
Abstract: Recent years have seen rapid progress in meta-learning methods, which transfer knowledge across tasks and domains to learn new tasks more efficiently, optimize the learning process itself, and even generate new learning methods from scratch. Meta-learning can be seen as the logical conclusion of the arc that machine learning has undergone in the last decade, from learning classifiers and policies over hand-crafted features, to learning representations over which classifiers and policies operate, and finally to learning algorithms that themselves acquire representations, classifiers, and policies.

Meta-learning methods are of substantial practical interest. For instance, they have been shown to yield new state-of-the-art automated machine learning algorithms and architectures, and have substantially improved few-shot learning systems. Moreover, the ability to improve one’s own learning capabilities through experience can also be viewed as a hallmark of intelligent beings, and there are strong connections with work on human learning in cognitive science and reward learning in neuroscience.

### Schedule

2020-12-11T03:00:00-08:00 - 2020-12-11T03:10:00-08:00
##### Introduction and opening remarks
2020-12-11T03:10:00-08:00 - 2020-12-11T03:11:00-08:00
##### Introduction for invited speaker, Frank Hutter
Jane Wang
2020-12-11T03:11:00-08:00 - 2020-12-11T03:36:00-08:00
##### Meta-learning neural architectures, initial weights, hyperparameters, and algorithm components
Frank Hutter
Meta-learning is a powerful set of approaches that promises to replace many components of the deep learning toolbox by learned alternatives, such as learned architectures, optimizers, hyperparameters, and weight initializations. While typical approaches focus on only one of these components at a time, in this talk, I will discuss various efficient approaches for tackling two of them simultaneously. I will also highlight the advantages of *not* learning complete algorithms from scratch but of rather exploiting the inductive bias of existing algorithms by learning to improve existing algorithms. Finally, I will briefly discuss the connection of meta-learning and benchmarks.
2020-12-11T03:36:00-08:00 - 2020-12-11T03:40:00-08:00
##### Q/A for invited talk #1
Frank Hutter
2020-12-11T03:40:00-08:00 - 2020-12-11T03:55:00-08:00
##### On episodes, Prototypical Networks, and few-shot learning
Steinar Laenen, Luca Bertinetto
Episodic learning is a popular practice among researchers and practitioners interested in few-shot learning. It consists of organising training in a series of learning problems, each relying on small support and query sets to mimic the few-shot circumstances encountered during evaluation. In this paper, we investigate the usefulness of episodic learning in Prototypical Networks, one the most popular algorithms making use of this practice. Surprisingly, in our experiments we found that episodic learning is detrimental to performance, and that it is under no circumstance beneficial to differentiate between a support and query set within a training batch. This non-episodic version of Prototypical Networks, which corresponds to the classic Neighbourhood Component Analysis, reliably improves over its episodic counterpart in multiple datasets, achieving an accuracy that is competitive with the state-of-the-art, despite being extremely simple.
2020-12-11T04:00:00-08:00 - 2020-12-11T05:00:00-08:00
##### Poster session #1
2020-12-11T05:00:00-08:00 - 2020-12-11T05:01:00-08:00
##### Introduction for invited speaker, Luisa Zintgraf
Francesco Visin
2020-12-11T05:01:00-08:00 - 2020-12-11T05:26:00-08:00
##### Exploration in meta-reinforcement learning
Luisa Zintgraf
Learning a new task often requires exploration: gathering data to learn about the environment and how to solve the task. But how do we efficiently explore, and how can an agent make the best use of prior knowledge it has about the environment? Meta-reinforcement learning allows us to learn inductive biases for exploration from data, which plays a crucial role in enabling agents to rapidly pick up new tasks. In the first part of this talk, I look at different meta-learning problem settings that exist in the literature, and what type of exploratory behaviour is necessary in these settings. This generally depends on how much time the agent has to interact with the environment, before its performance is evaluated. In the second part of the talk, we take a step back and consider how to meta-learn exploration strategies in the first place, which might require a different type of exploration during meta-learning. Throughout the talk, I will focus on the "online adaptation" setting where the agent has to perform well from the very first time step in a new environment. In these settings the agent has to very carefully trade off exploration and exploitation, since each action counts towards its final performance.
2020-12-11T05:26:00-08:00 - 2020-12-11T05:30:00-08:00
##### Q/A for invited talk #2
Luisa Zintgraf
2020-12-11T05:30:00-08:00 - 2020-12-11T05:31:00-08:00
##### Introduction for invited speaker, Tim Hospedales
Jonathan Schwarz
2020-12-11T05:31:00-08:00 - 2020-12-11T05:56:00-08:00
##### Meta-Learning: Representations and Objectives
Timothy Hospedales
In this talk, I will first give an overview perspective and taxonomy of major work the field, as motivated by our recent survey paper on meta-learning in neural networks. I hope that this will be informative for newcomers, as well as reveal some interesting connections and differences between the methods that will be thought-provoking for experts. I will then give a brief overview of recent meta-learning work from my group, which covers some broad issues in machine learning where meta-learning can be applied, including dealing with domain-shift, data augmentation, learning with label noise, and accelerating single task RL. Along the way, I will point out some of the many open questions that remain to be studied in the field.
2020-12-11T05:56:00-08:00 - 2020-12-11T06:00:00-08:00
##### Q/A for invited talk #3
Timothy Hospedales
2020-12-11T06:00:00-08:00 - 2020-12-11T07:00:00-08:00
##### Break
2020-12-11T07:00:00-08:00 - 2020-12-11T08:00:00-08:00
##### Poster session #2
2020-12-11T08:00:00-08:00 - 2020-12-11T08:01:00-08:00
##### Introduction for invited speaker, Louis Kirsch
Joaquin Vanschoren
2020-12-11T08:01:00-08:00 - 2020-12-11T08:26:00-08:00
##### General meta-learning
Louis Kirsch
Humans develop learning algorithms that are incredibly general and can be applied across a wide range of tasks. Unfortunately, this process is often tedious trial and error with numerous possibilities for suboptimal choices. General meta-learning seeks to automate many of these choices, generating new learning algorithms automatically. Different from contemporary meta-learning, where the generalization ability has been limited, these learning algorithms ought to be general-purpose. This allows us to leverage data at scale for learning algorithm design that is difficult for humans to consider. I present a General Meta Learner, MetaGenRL, that meta-learns novel Reinforcement Learning algorithms that can be applied to significantly different environments. We further investigate how we can reduce inductive biases and simplify meta-learning. Finally, I introduce variable-shared meta-learning (VS-ML), a novel principle that generalizes learned learning rules, fast weights, and meta-RNNs (learning in activations). This enables (1) implementing backpropagation purely in the recurrent dynamics of an RNN and (2) meta-learning algorithms for supervised learning from scratch.
2020-12-11T08:26:00-08:00 - 2020-12-11T08:30:00-08:00
##### Q/A for invited talk #4
Louis Kirsch
2020-12-11T08:30:00-08:00 - 2020-12-11T08:31:00-08:00
##### Introduction for invited speaker, Fei-Fei Li
Erin Grant
2020-12-11T08:31:00-08:00 - 2020-12-11T08:56:00-08:00
##### Creating diverse tasks to catalyze robot learning
Li Fei-Fei
Data has become an essential catalyst for the development of artificial intelligence. But it is challenging to obtain data for robotic learning. So how should we tackle this issue? In this talk, we start with a retrospective of how ImageNet and other large-scale datasets incentivized the deep learning revolution in the past decade, and aim to tackle the new challenges faced by robotic data. To this end, we introduce two lines of work in the Stanford Vision and Learning Lab on creating tasks to catalyze robot learning in this new era. We first present the design of a large-scale and realistic environment in simulation that enables human and robotic agents to perform interactive tasks. We further propose a novel approach for automatically generating suitable tasks as curricula to expedite reinforcement learning in hard-exploration problems.
2020-12-11T08:56:00-08:00 - 2020-12-11T09:00:00-08:00
##### Q/A for invited talk #5
Li Fei-Fei
2020-12-11T09:00:00-08:00 - 2020-12-12T10:00:00-08:00
##### Poster session #3
2020-12-11T10:00:00-08:00 - 2020-12-11T10:01:00-08:00
##### Introduction for invited speaker, Kate Rakelly
Erin Grant
2020-12-11T10:01:00-08:00 - 2020-12-11T10:26:00-08:00
##### An inference perspective on meta-reinforcement learning
Kate Rakelly
While meta-learning algorithms are often viewed as algorithms that learn to learn, an alternative viewpoint frames meta-learning as inferring a hidden task variable from experience consisting of observations and rewards. From this perspective, learning-to-learn is learning-to-infer. This viewpoint can be useful in solving problems in meta-reinforcement learning, which I’ll demonstrate through two examples: (1) enabling off-policy meta-learning and (2) performing efficient meta-reinforcement learning from image observations. Finally, I’ll discuss how I think this perspective can inform future meta-reinforcement learning research.
2020-12-11T10:26:00-08:00 - 2020-12-11T10:30:00-08:00
##### Q/A for invited talk #6
Kate Rakelly
2020-12-11T10:30:00-08:00 - 2020-12-11T10:45:00-08:00
##### Reverse engineering learned optimizers reveals known and novel mechanisms
Niru Maheswaranathan, David Sussillo, Luke Metz, Ruoxi Sun, Jascha Sohl-Dickstein
Learned optimizers are algorithms that can themselves be trained to solve optimization problems. In contrast to baseline optimizers (such as momentum or Adam) that use simple update rules derived from theoretical principles, learned optimizers use flexible, high-dimensional, nonlinear parameterizations. Although this can lead to better performance in certain settings, their inner workings remain a mystery. How is a learned optimizer able to outperform a well tuned baseline? Has it learned a sophisticated combination of existing optimization techniques, or is it implementing completely new behavior? In this work, we address these questions by careful analysis and visualization of learned optimizers. We study learned optimizers trained from scratch on three disparate tasks, and discover that they have learned interpretable mechanisms, including: momentum, gradient clipping, learning rate schedules, and a new form of learning rate adaptation. Moreover, we show how the dynamics of learned optimizers enables these behaviors. Our results help elucidate the previously murky understanding of how learned optimizers work, and establish tools for interpreting future learned optimizers.
2020-12-11T10:45:00-08:00 - 2020-12-11T11:00:00-08:00
##### Bayesian optimization by density ratio estimation
Louis Tiao, Aaron Klein, Cedric Archambeau, Edwin Bonilla, Matthias W Seeger, Fabio Ramos
Bayesian optimization (BO) is among the most effective and widely-used blackbox optimization methods. BO proposes solutions according to an explore-exploit trade-off criterion encoded in an acquisition function, many of which are derived from the posterior predictive of a probabilistic surrogate model. Prevalent among these is the expected improvement (EI). Naturally, the need to ensure analytical tractability in the model poses limitations that can ultimately hinder the efficiency and applicability of BO. In this paper, we cast the computation of EI as a binary classification problem, building on the well-known link between class probability estimation (CPE) and density ratio estimation (DRE), and the lesser-known link between density ratios and EI. By circumventing the tractability constraints imposed on the model, this reformulation provides several natural advantages, not least in scalability, increased flexibility, and greater representational capacity.
2020-12-11T11:00:00-08:00 - 2020-12-11T12:00:00-08:00
##### Putting Theory to Work: From Learning Bounds to Meta-Learning Algorithms
Quentin Bouniot
In this paper, we review the recent advances in meta-learning theory and show how they can be used in practice both to better understand the behavior of popular meta-learning algorithms and to improve their generalization capacity. This latter is achieved by integrating the theoretical assumptions ensuring efficient meta-learning in the form of regularization terms into several popular meta-learning algorithms for which we provide a large study of their behavior on classic few-shot classification benchmarks. To the best of our knowledge, this is the first contribution that puts the most recent learning bounds of meta-learning theory into practice for the popular task of few-shot classification.
##### Prototypical Region Proposal Networks for Few-shot Localization and Classification
Elliott Skomski
Recently proposed few-shot image classification methods have generally focused on use cases where the objects to be classified are the central subject of images. Despite success on benchmark vision datasets aligned with this use case, these methods typically fail on use cases involving densely-annotated, busy images: images common in the wild where objects of relevance are not the central subject, instead appearing potentially occluded, small, or among other incidental objects belonging to other classes of potential interest. To localize relevant objects, we employ a prototype-based few-shot segmentation model which compares the encoded features of unlabeled query images with support class centroids to produce region proposals indicating the presence and location of support set classes in a query image. These region proposals are then used as additional conditioning input to few-shot image classifiers. We develop a framework to unify the two stages (segmentation and classification) into an end-to-end classification model---PRoPnet---and empirically demonstrate that our methods improve accuracy on image datasets with natural scenes containing multiple object classes.
##### Defining Benchmarks for Continual Few-Shot Learning
Massimiliano Patacchiola
In recent years there has been substantial progress in few-shot learning, where a model is trained on a small labeled dataset related to a specific task, and in continual learning, where a model has to retain knowledge acquired on a sequence of datasets. However, the field has still to frame a suite of benchmarks for the hybrid setting combining these two paradigms, where a model is trained on several sequential few-shot tasks, and then tested on a validation set stemming from all those tasks. In this paper we propose such a setting, naming it Continual Few-Shot Learning (CFSL). We first define a theoretical framework for CFSL, then we propose a range of flexible benchmarks to unify the evaluation criteria. As part of the benchmark, we introduce a compact variant of ImageNet, called SlimageNet64, which retains all original 1000 classes but only contains 200 instances of each one (a total of 200K data-points) downscaled to 64 by 64 pixels. We provide baselines for the proposed benchmarks using a number of popular few-shot and continual learning methods, exposing previously unknown strengths and weaknesses of those algorithms. The dataloader and dataset will be released with an open-source license.
##### Decoupling Exploration and Exploitation in Meta-Reinforcement Learning without Sacrifices
Evan Liu
The goal of meta-reinforcement learning (meta-RL) is to build agents that can quickly learn new tasks by leveraging prior experience on related tasks. Learning a new task often requires both exploring to gather task-relevant information and exploiting this information to solve the task. In principle, optimal exploration and exploitation can be learned end-to-end by simply maximizing task performance. However, such meta-RL approaches struggle with local optima due to a chicken-and-egg problem: learning to explore requires good exploitation to gauge the exploration’s utility, but learning to exploit requires information gathered via exploration. Optimizing separate objectives for exploration and exploitation can avoid this problem, but prior meta-RL exploration objectives yield suboptimal policies that gather information irrelevant to the task. We alleviate both concerns by constructing an exploitation objective that automatically identifies task-relevant information and an exploration objective to recover only this information. This avoids local optima in end-to-end training, without sacrificing optimal exploration. Empirically, DREAM substantially outperforms existing approaches on complex meta-RL problems, such as sparse-reward 3D visual navigation.
##### Is Support Set Diversity Necessary for Meta-Learning?
Oscar Li
Meta-learning is a popular framework for learning with limited data in which a model is produced by training over multiple few-shot learning tasks. For classification problems, these tasks are typically constructed by sampling a small number of support and query examples from a subset of the classes. While conventional wisdom is that task diversity should improve the performance of meta-learning, in this work we find evidence to the contrary: we propose a modification to traditional meta-learning approaches in which we keep the support sets fixed across tasks, thus reducing task diversity. Surprisingly, we find that not only does this modification not result in adverse effects, it almost always improves the performance for a variety of datasets and meta-learning methods. We also provide several initial analyses to understand this phenomenon. Our work serves to: (i) more closely investigate the effect of support set construction for the problem of meta-learning, and (ii) suggest a simple, general, and competitive baseline for few-shot learning.
##### MobileDets: Searching for Object Detection Architecture for Mobile Accelerators
Yunyang Xiong
Inverted bottleneck layers, which are built upon depthwise convolutions, have been the predominant building blocks in state-of-the-art object detection models on mobile devices. In this work, we investigate the optimality of this design pattern over a broad range of mobile accelerators by revisiting the usefulness of regular convolutions. We achieve substantial improvements in the latency-accuracy trade-off by incorporating regular convolutions in the search space, effectively placing them in the network via neural architecture search, and directly optimizing the network architectures for object detection. We obtain a family of object detection models, MobileDets, that achieve state-of-the-art results across mobile accelerators. On the COCO object detection task, MobileDets outperform MobileNetV3+SSDLite by 1.7 mAP at comparable mobile CPU inference latencies. MobileDets also outperform MobileNetV2+SSDLite by 1.9 mAP on mobile CPUs, 3.7 mAP on EdgeTPUs, 3.4 mAP on DSPs and 2.7 mAP on edge GPUs without latency increase. Moreover, MobileDets are comparable with the state-of-the-art MnasFPN on mobile CPUs even without using the feature pyramid, and achieve better mAP scores on both EdgeTPUs and DSPs with up to 2x speedup.
##### Flexible Dataset Distillation: Learn Labels Instead of Images
Ondrej Bohdal
We study the problem of dataset distillation - creating a small set of synthetic examples capable of training a good model. In particular, we study the problem of label distillation - creating synthetic labels for a small set of real images, and show it to be more effective than the prior image-based approach to dataset distillation. Methodologically, we introduce a more robust and flexible meta-learning algorithm for distillation, as well as an effective first-order strategy based on convex optimization layers. Distilling labels with our new algorithm leads to improved results over prior image-based distillation. More importantly, it leads to clear improvements in flexibility of the distilled dataset in terms of compatibility with off-the-shelf optimizers and diverse neural architectures. Interestingly, label distillation can be applied across datasets, for example enabling learning Japanese character recognition by training only on synthetically labeled English letters.
##### Continual Model-Based Reinforcement Learning with Hypernetworks
Yizhou Huang
Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it enables constant-time dynamics learning sessions between planning and only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening.
Pan Zhou
##### Task Meta-Transfer from Limited Parallel Labels
Yiren Jian
In this work we introduce a novel meta-learning algorithm that learns to utilize the gradient information of auxiliary tasks to improve the performance of a model on a given primary task. Our proposed method learns to project gradients from the auxiliary tasks to the primary task from a {\em small} training set with parallel labels,'' i.e., examples annotated with respect to both the primary task and the auxiliary tasks. This strategy enables the learning of models with strong performance on the primary task by leveraging a large collection of auxiliary examples and few primary examples. Our scheme differs from methods for transfer learning, multi-task learning or domain adaptation in several ways: unlike na\"ive transfer learning, our strategy uses auxiliary examples to directly optimize the model with respect to the primary task instead of the auxiliary task; unlike hard-sharing multi-task learning methods, our algorithm devotes the entire capacity of the backbone model to attend the primary task instead of splitting it over multiple tasks; unlike most domain adaptation techniques, our scheme does not require any overlap in labels between the auxiliary and the primary task, thus enabling knowledge transfer between completely disjoint tasks. Experiments on two image analysis benchmarks involving multiple tasks demonstrate the performance improvements of our meta-learning scheme over na\"ive transfer learning, multi-task learning as well as prior related work.
##### Synthetic Petri Dish: A Novel Surrogate Model for Rapid Architecture Search
Neural Architecture Search (NAS) explores a large space of architectural motifs -- a compute-intensive process that often involves ground-truth evaluation of each motif by instantiating it within a large network, and training and evaluating the network with thousands or more data samples. Inspired by how biological motifs such as cells are sometimes extracted from their natural environment and studied in an artificial Petri dish setting, this paper proposes the Synthetic Petri Dish model for evaluating architectural motifs. In the Synthetic Petri Dish, architectural motifs are instantiated in very small networks and evaluated using very few learned synthetic data samples (to effectively approximate performance in the full problem). The relative performance of motifs in the Synthetic Petri Dish can substitute for their ground-truth performance, thus accelerating the most expensive step of NAS. Unlike other neural network-based prediction models that parse the structure of the motif to estimate its performance, the Synthetic Petri Dish predicts motif performance by training the actual motif in an artificial setting, thus deriving predictions from its true intrinsic properties. Experiments in this paper demonstrate that the Synthetic Petri Dish can therefore predict the performance of new motifs with significantly higher accuracy, especially when insufficient ground truth data is available. Our hope is that this work can inspire a new research direction in studying the performance of extracted components of models in a synthetic diagnostic setting optimized to provide informative evaluations.
##### Contextual HyperNetworks for Novel Feature Adaptation
Angus Lamb
While deep learning has obtained state-of-the-art results in many applications, the adaptation of neural network architectures to incorporate new output features remains a challenge, as a neural networks are commonly trained to produce a fixed output dimension. This issue is particularly severe in online learning settings, where new output features, such as items in a recommender system, are added continually with few or no associated observations. As such, methods for adapting neural networks to novel features which are both time and data-efficient are desired. To address this, we propose the Contextual HyperNetwork (CHN), an auxiliary model which generates parameters for extending the base model to a new feature, by utilizing both existing data as well as any observations and/or metadata associated with the new feature. At prediction time, the CHN requires only a single forward pass through a neural network, yielding a significant speed-up when compared to re-training and fine-tuning approaches. To assess the performance of CHNs, we use a CHN to augment a partial variational autoencoder (P-VAE), a deep generative model which can impute the values of missing features in sparsely-observed data. We show that this system obtains improved few-shot learning performance for novel features over existing imputation and meta-learning baselines across recommender systems, e-learning, and healthcare tasks.
##### Tailoring: encoding inductive biases by optimizing unsupervised objectives at prediction time
Ferran Alet
From CNNs to attention mechanisms, encoding inductive biases into neural networks has been a fruitful source of improvement in machine learning. Auxiliary losses are a general way of encoding biases in order to help networks learn better representations by adding extra terms to the loss function. However, since they are minimized on the training data, they suffer from the same generalization gap as regular task losses. Moreover, by changing the loss function, the network is optimizing a different objective than the one we care about. In this work we solve both problems: first, we take inspiration from transductive learning and note that, after receiving an input but before making a prediction, we can fine-tune our models on any unsupervised objective. We call this process tailoring, because we customize the model to each input. Second, we formulate a nested optimization (similar to those in meta-learning) and train our models to perform well on the task loss after adapting to the tailoring loss. The advantages of tailoring and meta-tailoring are discussed theoretically and demonstrated empirically on several diverse examples: encoding inductive conservation laws from physics to improve predictions, improving local smoothness to increase robustness to adversarial examples, and using contrastive losses on the query image to improve generalization.
##### MPLP: Learning a Message Passing Learning Protocol
Ettore Randazzo
We present a novel method for learning the weights of an artificial neural network: a Message Passing Learning Protocol (MPLP). In MPLP, we abstract every operation occurring in ANNs as independent agents. Each agent is responsible for ingesting incoming multidimensional messages from other agents, updating its internal state, and generating multidimensional messages to be passed on to neighbouring agents. We demonstrate the viability of MPLP as opposed to traditional gradient-based approaches on simple feed-forward neural networks, and present a framework capable of generalizing to non-traditional neural network architectures. MPLP is meta learned using end-to-end gradient-based meta-optimisation.
##### Meta-Learning Bayesian Neural Network Priors Based on PAC-Bayesian Theory
Jonas Rothfuss
Bayesian Neural Networks (BNNs) are a promising approach towards improved uncertainty quantification and sample efficiency. Due to their complex parameter space, choosing informative priors for BNNs is challenging. Thus, often a naive, zero-centered Gaussian is used, resulting both in bad generalization and poor uncertainty estimates when training data is scarce. In contrast, meta-learning aims to extract such prior knowledge from a set of related learning tasks. We propose a principled and scalable algorithm for meta-learning BNN priors based on PAC-Bayesian bounds. Whereas previous approaches require optimizing the prior and multiple variational posteriors in an interdependent manner, our method does not rely on difficult nested optimization problems. Our experiments show that the proposed method is not only computationally more efficient but also yields better predictions and uncertainty estimates when compared to previous meta-learning methods and BNNs with standard priors.
##### How Important is the Train-Validation Split in Meta-Learning?
Yu Bai
Meta-learning aims to perform fast adaptation on a new task through learning a “prior” from multiple existing tasks. A common practice in meta-learning is to perform a \emph{train-validation split} where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct \emph{non-splitting} method, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful on the linear centroid meta-learning problem, in the asymptotic setting where the number of tasks goes to infinity. We show that the splitting method converges to the optimal prior as expected, whereas the non-splitting method does not in general without structural assumptions on the data. In contrast, if the data are generated from linear models (the realizable regime), we show that both the splitting and non-splitting methods converge to the optimal prior. Further, perhaps surprisingly, our main result shows that the non-splitting method achieves a \emph{strictly better} asymptotic excess risk under this data distribution, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that data splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the non-splitting method can indeed outperform the splitting method, on both simulations and real meta-learning tasks.
##### Meta-Learning Initializations for Image Segmentation
Sean Hendryx
We evaluate first-order model agnostic meta-learning algorithms (including FOMAML and Reptile) on few-shot image segmentation, present a novel neural network architecture built for fast learning which we call EfficientLab, and leverage a formal definition of the test error of meta-learning algorithms to decrease error on out of distribution tasks. We show state of the art results on the FSS-1000 dataset by meta-training EfficientLab with FOMAML and using Bayesian optimization to infer the optimal test-time adaptation routine hyperparameters. We also construct a benchmark dataset, binary PASCAL, for the empirical study of how image segmentation meta-learning systems improve as a function of the number of labeled examples. On the binary PASCAL dataset, we show that when generalizing out of meta-distribution, meta-learned initializations provide only a small improvement over joint training in accuracy but require significantly fewer gradient updates. Our code and meta-learned model are available at https://drive.google.com/drive/folders/1VhTJtYQ_byC9woS1fBaRi-hdWksfm5qq?usp=sharing.
##### Open-Set Incremental Learning via Bayesian Prototypical Embeddings
John Willes
As autonomous decision-making agents move from narrow operating environments to unstructured worlds, learning systems must move from a closed-world formulation to an open-world, incremental, few-shot setting in which agents continuously learn new labels from small amounts of information. This stands in stark contrast to modern machine learning systems that are typically designed with a known set of classes and a large number of examples for each class. In this work, we extend embedding-based few-shot learning algorithms toward open-world problems. In particular, we investigate both the lifelong setting---in which an entirely new set of classes exists at evaluation time---as well as the incremental setting, in which new classes are added to a set of base classes available at training time. We combine Bayesian non-parametric class priors with an embedding-based pre-training scheme to yield a highly flexible framework for use in both the lifelong and the incremental settings. We benchmark our framework on MiniImageNet and, and show strong performance compared to baseline methods.
##### Learning not to learn: Nature versus nurture in silico
Rob Lange
Animals are equipped with a rich innate repertoire of sensory, behavioral and motor skills, which allows them to interact with the world immediately after birth. At the same time, many behaviors are highly adaptive and can be tailored to specific environments by means of learning. In this work, we use mathematical analysis and the framework of meta-learning (or 'learning to learn') to answer when it is beneficial to learn such an adaptive strategy and when to hard-code a heuristic behavior. We find that the interplay of ecological uncertainty, task complexity and the agents' lifetime has crucial effects on the meta-learned amortized Bayesian inference performed by an agent. There exist two regimes: One in which meta-learning yields a learning algorithm that implements task-dependent information-integration and a second regime in which meta-learning imprints a heuristic or 'hard-coded' behavior. Further analysis reveals that non-adaptive behaviors are not only optimal for aspects of the environment that are stable across individuals, but also in situations where an adaptation to the environment would in fact be highly beneficial, but could not be done quickly enough to be exploited within the remaining lifetime. Hard-coded behaviors should hence not only be those that always work, but also those that are too complex to be learned within a reasonable time frame.
##### Prior-guided Bayesian Optimization
Artur Souza
While Bayesian Optimization (BO) is a very popular method for optimizing expensive black-box functions, it fails to leverage the knowledge of domain experts. This causes BO to waste function evaluations on bad design choices (e.g., machine learning hyperparameters) that the expert already knows to work poorly. To address this issue, we introduce Prior-guided Bayesian Optimization (PrBO). PrBO allows users to transfer their knowledge into the optimization process in the form of priors about which parts of the input space will yield the best performance, rather than BO’s standard priors over functions (which are much less intuitive for users). PrBO then combines these priors with BO’s standard probabilistic model to form a pseudo-posterior used to select which points to evaluate next. We show that PrBO is around 12x faster than state-of-the-art methods without user priors and 10,000x faster than random search on a common suite of benchmarks. PrBO also converges faster even if the user priors are not entirely accurate and robustly recovers from misleading priors.
Luke Metz
We present TaskSet, a dataset of tasks for use in training and evaluating optimizers. TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional neural networks, to variational autoencoders, to non-volume preserving flows on a variety of datasets. As an example application of such a dataset we explore meta-learning an ordered list of hyperparameters to try sequentially. By learning this hyperparameter list from data generated using TaskSet we achieve large speedups in sample efficiency over random search. Next we use the diversity of the TaskSet and our method for learning hyperparameter lists to empirically explore the generalization of these lists to new optimization tasks in a variety of settings including ImageNet classification with Resnet50 and LM1B language modeling with transformers. As part of this work we have opensourced code for all tasks, as well as ~29 million training curves for these problems and the corresponding hyperparameters.
##### Exploring Representation Learning for Flexible Few-Shot Tasks
Mengye Ren
Existing approaches to few-shot learning deal with tasks that have persistent, rigid notions of classes. Typically, the learner observes data only from a fixed number of classes at training time and is asked to generalize to a new set of classes at test time. Two examples from the same class would always be assigned the same labels in any episode. In this work, we consider a realistic setting where the relationship between examples can change from episode to episode depending on the task context, which is not given to the learner. We define two new benchmark datasets for this flexible few-shot scenario, where the tasks are based on images of faces (Celeb-A) and shoes (Zappos50K). While classification baselines learn representations that work well for standard few-shot learning, they suffer in our flexible tasks since the classification criteria shift from training to testing. On the other hand, unsupervised contrastive representation learning with instance-based invariance objectives preserves such flexibility. A combination of instance and class invariance learning objectives is found to perform best on our new flexible few-shot learning benchmarks, and a novel variant of Prototypical Networks is proposed for selecting useful feature dimensions.
##### Hyperparameter Transfer Across Developer Adjustments
Danny Stoll
After developer adjustments to a machine learning (ML) system, how can the results of an old hyperparameter optimization automatically be used to speed up a new hyperparameter optimization? This question poses a challenging problem, as developer adjustments can change which hyperparameter settings perform well, or even the hyperparameter space itself. While many approaches exist that leverage knowledge obtained on previous tasks, so far, knowledge from previous development steps remains entirely untapped. In this work, we remedy this situation and propose a new research framework: hyperparameter transfer across adjustments (HT-AA). To lay a solid foundation for this research framework, we provide four HT-AA baseline algorithms and eight benchmarks. The best baseline, on average, reaches a given performance 2x faster than a prominent HPO algorithm without transfer. As hyperparameter optimization is a crucial step in ML development but requires extensive computational resources, this speed up would lead to faster development cycles, lower costs, and reduced environmental impacts. To make these benefits available to ML developers off-the-shelf, we provide a python package that implements the proposed transfer algorithm.
##### Towards Meta-Algorithm Selection
Alexander Tornede
Instance-specific algorithm selection (AS) deals with the automatic selection of an algorithm from a fixed set of candidates most suitable for a specific instance of an algorithmic problem class, where "suitability" often refers to an algorithm's runtime. Over the past years, a plethora of algorithm selectors have been proposed.As an AS selector is again an algorithm solving a specific problem, the idea of algorithm selection could also be applied to AS algorithms, leading to a meta-AS approach: Given an instance, the goal is to select an algorithm selector, which is then used to select the actual algorithm for solving the problem instance. We elaborate on consequences of applying AS on a meta-level and identify possible problems. Empirically, we show that meta-algorithm selection can indeed prove beneficial in some cases. In general, however, successful AS approaches have problems with solving the meta-level problem.
##### Continual learning with direction-constrained optimization
Yunfei Teng
This paper studies a new design of the optimization algorithm for training deep learning models with a fixed architecture of the classification network in a continual learning framework, where the training data is non-stationary and the non-stationarity is imposed by a sequence of distinct tasks. This setting implies the existence of a manifold of network parameters that correspond to good performance of the network on all tasks. Our algorithm is derived from the geometrical properties of this manifold. We first analyze a deep model trained on only one learning task in isolation and identify a region in network parameter space, where the model performance is close to the recovered optimum. We provide empirical evidence that this region resembles a cone that expands along the convergence direction. We study the principal directions of the trajectory of the optimizer after convergence and show that traveling along a few top principal directions can quickly bring the parameters outside the cone but this is not the case for the remaining directions. We argue that catastrophic forgetting in a continual learning setting can be alleviated when the parameters are constrained to stay within the intersection of the plausible cones of individual tasks that were so far encountered during training. Enforcing this is equivalent to preventing the parameters from moving along the top principal directions of convergence corresponding to the past tasks. For each task we introduce a new linear autoencoder to approximate its corresponding top forbidden principal directions. They are then incorporated into the loss function in the form of a regularization term for the purpose of learning the coming tasks without forgetting. We empirically demonstrate that our algorithm performs favorably compared to other state-of-art regularization-based continual learning methods, including EWC and SI.
##### Meta-Learning of Compositional Task Distributions in Humans and Machines
Sreejan Kumar
Modern machine learning systems struggle with sample efficiency and are usually trained with enormous amounts of data for each task. This is in sharp contrast with humans, who often learn with very little data. In recent years, meta-learning, in which one trains on a family of tasks (i.e. a task distribution), has emerged as an approach to improving the sample complexity of machine learning systems and to closing the gap between human and machine learning. However, in this paper, we argue that current meta-learning approaches still differ significantly from human learning. We argue that humans learn over tasks by constructing compositional generative models and using these to generalize, whereas current meta-learning methods are biased toward the use of simpler statistical patterns. To highlight this difference, we construct a new meta-reinforcement learning task with a compositional task distribution. We also introduce a novel approach to constructing a null task distribution'' with the same statistical complexity as the compositional distribution but without explicit compositionality. We train a standard meta-learning agent, a recurrent network trained with model-free reinforcement learning, and compare it with human performance across the two task distributions. We find that humans do better in the compositional task distribution whereas the agent does better in the non-compositional null task distribution -- despite comparable statistical complexity. This work highlights a particular difference between human learning and current meta-learning models, introduces a task that displays this difference, and paves the way for future work on human-like meta-learning.
##### Learning to Generate Noise for Multi-Attack Robustness
The majority of existing adversarial defense methods are tailored to defend against a single category of adversarial perturbation (e.g. $\ell_\infty$-attack). However, this makes these methods extraneous as the attacker can adopt diverse adversaries to deceive the system. Moreover, training on multiple perturbations simultaneously significantly increases the computational overhead during training. To address these challenges, we propose a novel meta-learning framework that explicitly learns to generate noise to improve the model's robustness against multiple types of attacks. Its key component is Meta Noise Generator (MNG) that outputs optimal noise to stochastically perturb a given sample, such that it helps lower the error on diverse adversarial perturbations. By utilizing samples generated by MNG, we train a model by enforcing the label consistency across multiple perturbations. We validate the robustness of models trained by our scheme on various datasets and against a wide variety of perturbations, demonstrating that it significantly outperforms the baselines across multiple perturbations with a marginal computational cost.
Davide Buffelli
##### Multi-Objective Multi-Fidelity Hyperparameter Optimization with Application to Fairness
Robin Schmucker
In many real-world applications, the performance of machine learning models is evaluated not along a single objective, but across multiple, potentially competing ones. For instance, for a model deciding whether to grant or deny loans, it is critical to make sure decisions are fair and not only accurate. As it is often infeasible to find a single model performing best across all objectives, practitioners are forced to find a trade-off between the individual objectives. While several multi-objective optimization (MO) techniques have been proposed in the machine learning literature (and beyond), little effort has been put towards using MO for hyperparameter optimization (HPO) problems; a task that has gained immense relevance and adoption in recent years. In this paper, we evaluate the suitability of existing MO algorithms for HPO and propose a novel multi-fidelity method for this problem. We evaluate our approach on public datasets with a special emphasis on fairness-motivated applications, and report substantially lower wall-clock times when approximating Pareto frontiers compared to the state-of-the-art.
##### Measuring few-shot extrapolation with program induction
Ferran Alet
Neural networks are capable of learning complex functions, but still have problems generalizing from few examples and beyond their training distribution. Meta-learning provides a paradigm to train networks to learn from few examples, but it has been shown that some of its most popular benchmarks do not require significant adaptation to each task nor learning representations that extrapolate beyond the training distribution. Program induction lies at the opposite end of the spectrum: programs are capable of extrapolating from very few examples, but we still do not know how to efficiently search these discrete spaces. We propose a common benchmark for both communities, by learning to extrapolate from few examples coming from the execution of small programs. These are obtained by leveraging a C++ interpreter on codes from programming competitions and extracting small sub-codes with their corresponding input-output pairs. Statistical analysis and preliminary human experiments show the potential of this benchmark for enabling progress in few-shot extrapolation.
##### NAS-Bench-301 and the Case for Surrogate Benchmarks for Neural Architecture Search
Julien Siems
Several tabular NAS benchmarks have been proposed to simulate runs of NAS methods in seconds in order to allow scientifically sound empirical evaluations. However, all existing tabular NAS benchmarks are limited to extremely small architectural spaces since they rely on exhaustive evaluations of the space. This leads to unrealistic results that do not transfer to larger search spaces. Motivated by the fact that similar architectures tend to yield comparable results, we propose NAS-Bench-301 which covers a search space many orders of magnitude larger than any previous NAS benchmark. We achieve this by meta-learning a performance predictor that predicts the capability of different neural architectures to facilitate base-level learning, and using it to define a surrogate benchmark. We fit various regression models on our dataset, which consists of ~60k architecture evaluations, and build surrogates via deep ensembles to also model uncertainty. We benchmark a wide range of NAS algorithms using NAS-Bench-301 and obtain comparable results to the true benchmark at a fraction of the real cost.
##### Model-Agnostic Graph Regularization for Few-Shot Learning
Ethan Z Shen
In many domains, relationships between categories are encoded in the knowledge graph. Recently, promising results have been achieved by incorporating knowledge graphs as a side-information in hard classification tasks with severely limited data. However, prior models consist of highly complex architectures with many sub-components that all seem to impact performance. In this paper, we present a comprehensive empirical study on graph embedded few-shot learning. We introduce a graph regularization approach that allows deeper understanding of the impact of incorporating graph information between labels. Our proposed regularization is widely applicable and model-agnostic, and boosts performance of any few-shot learning model, including metric-learning, meta-learning, and fine-tuning. Our approach improves strong base learners by up to 2% on Mini-ImageNet and 6.7% on ImageNet-FS, outperforming state-of-the-art models and other graph embedded methods. Additional analyses reveal that graph regularizing models results in lower loss for more difficult tasks such as lower-shot and less informative few-shot episodes.
##### Uniform Priors for Meta-Learning
Samarth Sinha
Deep Neural Networks have shown great promise on a variety of downstream applications; but their ability to adapt and generalize to new data and tasks remains a challenging problem. However, the ability to perform few-shot adaptation to novel tasks is important for the scalability and deployment of machine learn-ing models. It is therefore crucial to understand what makes for good, transferable features in deep networks that best allow for such adaptation. In this paper, we shed light on this by showing that features that are most transferable have high uniformity in the embedding space and propose a uniformity regularization scheme that encourages better transfer and feature reuse for few-shot learning. We evaluate our regularization on few-shot Meta-Learning benchmarks and show that uniformity regularization consistently offers benefits over baseline methods while also being able to achieve state-of-the-art on the Meta-Dataset
##### Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift
Marvin Zhang
A fundamental assumption of most machine learning algorithms is that the training and test data are drawn from the same underlying distribution. However, this assumption is violated in almost all practical applications: machine learning systems are regularly tested under distribution shift, due to temporal correlations, particular end users, or other factors. In this work, we consider the setting where the training data are structured into groups and test time shifts correspond to changes in the group distribution. Prior work has approached this problem by attempting to be robust to all possible test time distributions, which may degrade average performance. In contrast, we propose to use ideas from meta-learning to learn models that are adaptable, such that they can adapt to shift at test time using a batch of unlabeled test points. We acquire such models by learning to adapt to training batches sampled according to different distributions, which simulate structural shifts that may occur at test time. Our primary contribution is to introduce the framework of adaptive risk minimization (ARM), a formalization of this setting that lends itself to meta-learning. We develop meta-learning methods for solving the ARM problem, and compared to a variety of prior methods, these methods provide substantial gains on image classification problems in the presence of shift.
Cuong C Nguyen
##### HyperVAE: Variational Hyper-Encoding Network
Phuoc Nguyen
We propose a framework called HyperVAE for encoding distributions of distributions. When a target distribution is modeled by a VAE, its neural network parameters \theta are drawn from a distribution p(\theta) which is modeled by a hyper-level VAE. Given a target distribution, we predict the posterior distribution of the latent code, then use a matrix-network decoder to generate a posterior distribution q(\theta). HyperVAE can encode the parameters \theta in full in contrast to common hyper-networks practices, which generate only the scale and bias vectors to modify the target-network parameters. Thus HyperVAE preserves information about the model for each task in the latent space. We evaluate HyperVAE in density estimation tasks, outlier detection and discovery of novel design classes.
##### Meta-Learning via Hypernetworks
Dominic Zhao
Recent developments in few-shot learning have shown that during fast adaption, gradient-based meta-learners mostly rely on embedding features of powerful pretrained networks. This leads us to research ways to effectively adapt features and utilize the meta-learner's full potential. Here, we demonstrate the effectiveness of hypernetworks in this context. We propose a soft row-sharing hypernetwork architecture and show that training the hypernetwork with a variant of MAML is tightly linked to meta-learning a curvature matrix used to condition gradients during fast adaptation. We achieve similar results as state-of-art model-agnostic methods in the overparametrized case, while outperforming many MAML variants without using different optimization schemes in the compressive regime. Furthermore, we empirically show that hypernetworks do leverage the inner loop optimization for better adaptation, and analyse how they naturally try to learn the shared curvature of constructed tasks on a toy problem when using our proposed training algorithm.
##### Learning in Low Resource Modalities via Cross-Modal Generalization
Paul Pu Liang
The natural world is abundant with underlying concepts expressed naturally in multiple heterogeneous sources such as the visual, acoustic, tactile, and linguistic modalities. Despite vast differences in these raw modalities, humans seamlessly perceive multimodal data, learn new concepts, and show extraordinary capabilities in generalizing across input modalities. Much of the existing progress in multimodal learning, however, focuses primarily on problems where the same set of modalities are present at train and test time, which makes learning in low-resource modalities particularly difficult. In this work, we propose a general algorithm for cross-modal generalization: a learning paradigm where data from more abundant source modalities is used to learn useful representations for scarce target modalities. Our algorithm is based on meta-alignment, a novel method to align representation spaces across modalities while ensuring quick generalization to new concepts. Experimental results on generalizing from image to audio classification and from text to speech classification demonstrate strong performance on classifying data from an entirely new target modality with only a few (1-10) labeled samples. In addition, our method works particularly well when the target modality suffers from noisy or limited labels, a scenario particularly prevalent in low-resource modalities.
##### Learning Flexible Classifiers with Shot-CONditional Episodic (SCONE) Training
Eleni Triantafillou
Early few-shot classification work advocates for episodic training, i.e. training over learning episodes each posing a few-shot classification task. However, the role of this training regime remains poorly understood. Standard classification methods (pre-training'') followed by episodic fine-tuning have recently achieved strong results. We aim to understand the role of this episodic fine-tuning phase through an exploration of the effect of the shot'' (number of examples per class) that is used during fine-tuning. We discover that using a fixed shot can specialize the pre-trained model to solving episodes of that shot at the expense of performance on other shots, in agreement with a trade-off recently observed in the context of end-to-end episodic training. To amend this, we propose a shot-conditional form of episodic fine-tuning, inspired from recent work that trains a single model on a distribution of losses. We show that this flexible approach consitutes an effective general solution that does not suffer disproportionately on any shot. We then subject it to the large-scale Meta-Dataset benchmark of varying shots and imbalanced episodes and observe performance gains in that challenging environment.
##### Few-shot Sequence Learning with Transformers
Lajanugen Logeswaran
Few-shot algorithms aim at learning new tasks provided only a handful of training examples. In this work we investigate few-shot learning in the setting where the data points are sequences (or sets) of tokens and propose an efficient learning algorithm based on Transformers. In the simplest setting, we append a token to an input sequence which represents the particular task to be undertaken, and show that the embedding of this token can be optimized on the fly given few labeled examples. Our approach does not require complicated changes to the model architecture such as adapter layers nor computing second order derivatives as is currently popular in the meta-learning and few-shot learning literature. We demonstrate our approach on a variety of tasks, and analyze the generalization properties of several model variants and baseline approaches. In particular, we show that compositional task descriptors can improve performance. Experiments show that our approach works at least as well as other methods, while being more computationally efficient.
Suneel Belkhale
##### Data Augmentation for Meta-Learning
Renkun Ni
Conventional image classifiers are trained by randomly sampling mini-batches of images. To achieve state-of-the-art performance, sophisticated data augmentation schemes are used to expand the amount of training data available for sampling. In contrast, meta-learning algorithms sample not only images, but classes as well. We investigate how data augmentation can be used not only to expand the number of images available per class, but also to generate entirely new classes. We systematically dissect the meta-learning pipeline and investigate the distinct ways in which data augmentation can be integrated at both the image and class levels. Our proposed meta-specific data augmentation significantly improves the performance of meta-learners on few-shot classification benchmarks.
##### Pareto-efficient Acquisition Functions for Cost-Aware Bayesian Optimization
Gauthier Guinet
Bayesian optimization (BO) is a popular method to optimize expensive black-box functions. It efficiently tunes machine learning algorithms under the implicit assumption that hyperparameter evaluations cost approximately the same. In reality, the cost of evaluating different hyperparameters, be it in terms of time, dollars or energy, can span several orders of magnitude of difference. While a number of heuristics have been proposed to make BO cost-aware, none of these have been proven to work robustly. In this work, we reformulate cost-aware BO in terms of Pareto efficiency and introduce the cost Pareto Front, a mathematical object allowing us to highlight the shortcomings of commonly used acquisition functions. Based on this, we propose a novel Pareto-efficient adaptation of the expected improvement. On 144 real-world black-box function optimization problems we show that our Pareto-efficient acquisition functions significantly outperform previous solutions, bringing up to 50\% speed-ups while providing finer control over the cost-accuracy trade-off. We also revisit the common choice of Gaussian process cost models, showing that simple, low-variance cost models predict training times effectively
##### Few-Shot Unsupervised Continual Learning through Meta-Examples
Alessia Bertugli
In real-world applications, data do not reflect the ones commonly used for neural networks training, since they are usually few, unbalanced, unlabeled and can be available as a stream. Hence many existing deep learning solutions suffer from a limited range of applications, in particular in the case of online streaming data that evolve over time. To narrow this gap, in this work we introduce a novel and complex setting involving unsupervised meta-continual learning with unbalanced tasks. These tasks are built through a clustering procedure applied to a fitted embedding space. We exploit a meta-learning scheme that simultaneously alleviates catastrophic forgetting and favors the generalization to new tasks. Moreover, to encourage feature reuse during the meta-optimization, we exploit a single inner loop taking advantage of an aggregated representation achieved through the use of a self-attention mechanism. Experimental results on few-shot learning benchmarks show competitive performance even compared to the supervised case. Additionally, we empirically observe that in an unsupervised scenario, the small tasks and the variability in the clusters pooling play a crucial role in the generalization capability of the network. Further, on complex datasets, the exploitation of more clusters than the true number of classes leads to higher results, even compared to the ones obtained with full supervision, suggesting that a predefined partitioning into classes can miss relevant structural information.
##### Meta-Learning Backpropagation And Improving It
Louis Kirsch
In the past a large number of variable update rules have been proposed for meta learning such as fast weights, hyper networks, learned learning rules, and meta recurrent neural networks. We unify these architectures by demonstrating that a single weight-sharing and sparsity principle underlies them that can be used to express complex learning algorithms. We propose a simple implementation of this principle, the Variable Shared Meta RNN, and demonstrate that it allows implementing neuronal dynamics and backpropagation solely by running the recurrent neural network in forward-mode. This offers a direction for backpropagation that is biologically plausible. Then we show how backpropagation itself can be further improved through meta-learning. That is, we can use a human-engineered algorithm as an initialization for meta-learning better learning algorithms.
##### MAster of PuPpets: Model-Agnostic Meta-Learning via Pre-trained Parameters for Natural Language Generation
ChienFu Lin
Pre-trained Transformer-based language models have been an enormous success in generating realistic natural language. However, how to adapt these models to specific domains effectively remains unsolved. On the other hand, Model-Agnostic Meta-Learning (MAML) has been an influential framework for few-shot learning, while how to determine the initial parameters of MAML is still not well-researched. In this paper, we fuse the information from the pre-training stage with meta-learning to learn how to adapt a pre-trained generative model to a new domain. In particular, we find that applying the pre-trained information as the initial state of meta-learning helps the model adapt to new tasks efficiently and is competitive with the state-of-the-art results over evaluation metrics on the Persona dataset. Besides, in few-shot experiments, we show that the proposed model converges significantly faster than naive transfer learning baselines.