Timezone: »

Workshop
Has it Trained Yet? A Workshop for Algorithmic Efficiency in Practical Neural Network Training
Frank Schneider · Zachary Nado · Philipp Hennig · George Dahl · Naman Agarwal

Fri Dec 02 06:30 AM -- 03:00 PM (PST) @ Theater B

Workshop Description

Training contemporary neural networks is a lengthy and often costly process, both in human designer time and compute resources. Although the field has invented numerous approaches, neural network training still usually involves an inconvenient amount of “babysitting” to get the model to train properly. This not only requires enormous compute resources but also makes deep learning less accessible to outsiders and newcomers. This workshop will be centered around the question “How can we train neural networks faster” by focusing on the effects algorithms (not hardware or software developments) have on the training time of neural networks. These algorithmic improvements can come in the form of novel methods, e.g. new optimizers or more efficient data selection strategies, or through empirical experience, e.g. best practices for quickly identifying well-working hyperparameter settings or informative metrics to monitor during training.

We all think we know how to train deep neural networks, but we all seem to have different ideas. Ask any deep learning practitioner about the best practices of neural network training, and you will often hear a collection of arcane recipes. Frustratingly, these hacks vary wildly between companies and teams. This workshop offers a platform to talk about these ideas, agree on what is actually known, and what is just noise. In this sense, this will not be an “optimization workshop” in the mathematical sense (of which there have been several in the past, of course).

To this end, the workshop’s goal is to connect two communities: Researchers who develop new algorithms for faster neural network training, such as new optimization methods or deep learning architectures. Practitioners who, through their work on real-world problems, are increasingly relying on “tricks of the trade”. The workshop aims to close the gap between research and applications, identifying the most relevant current issues that hinder faster neural network training in practice.

Topics

Among the topics addressed by the workshop are:

- What “best practices” for faster neural network training are used in practice and can we learn from them to build better algorithms?
- What are painful lessons learned while training deep learning models?
- What are the most needed algorithmic improvements for neural network training?
- How can we ensure that research on training methods for deep learning has practical relevance?

Important Dates

- Submission Deadline: September 30, 2022, 07:00am UTC (updated!)
- Accept/Reject Notification Date: October 20, 2022, 07:00am UTC (updated!)
- Workshop Date: December 2, 2022

 Fri 6:30 a.m. - 6:40 a.m. Welcome and Opening Remarks (Opening Remarks) 🔗 Fri 6:40 a.m. - 7:10 a.m. Invited Talk by Aakanksha Chowdhery (Talk) Aakanksha Chowdhery 🔗 Fri 7:10 a.m. - 7:20 a.m. Q & A with Aakanksha Chowdhery (Q & A) 🔗 Fri 7:20 a.m. - 7:50 a.m. Benchmarking Trainng Algorithms by Zachary Nado (Talk) Zachary Nado 🔗 Fri 7:50 a.m. - 8:00 a.m. Q & A with Zachary Nado (Q & A) 🔗 Fri 8:00 a.m. - 8:15 a.m. Coffee Break (Break) 🔗 Fri 8:15 a.m. - 8:45 a.m. Invited Talk by ‪Jimmy Ba (Talk) Jimmy Ba 🔗 Fri 8:45 a.m. - 8:55 a.m. Q & A with Jimmy Ba (Q & A) 🔗 Fri 8:55 a.m. - 9:25 a.m. Invited Talk by ‪Susan Zhang (Talk) Susan Zhang 🔗 Fri 9:25 a.m. - 9:35 a.m. Q & A with Susan Zhang (Q & A) 🔗 Fri 9:35 a.m. - 10:00 a.m. Interactive Audience Session (Q & A) 🔗 Fri 10:00 a.m. - 11:30 a.m. Lunch Break (Break) 🔗 Fri 11:30 a.m. - 12:00 p.m. Invited Talk by ‪Boris Dayma (Talk) Boris Dayma 🔗 Fri 12:00 p.m. - 12:10 p.m. Q & A with Boris Dayma (Q & A) 🔗 Fri 12:10 p.m. - 1:00 p.m. Poster Session and Open Discussion (Poster Session) 🔗 Fri 1:00 p.m. - 1:15 p.m. Coffee Break (Break) 🔗 Fri 1:15 p.m. - 1:45 p.m. Invited Talk by ‪Stanislav Fort (Talk) Stanislav Fort 🔗 Fri 1:45 p.m. - 1:55 p.m. Q & A with Stanislav Fort (Q & A) 🔗 Fri 1:55 p.m. - 3:00 p.m. Panel Discussion 🔗 - Can Calibration Improve Sample Prioritization? (Poster) []  []   link » Calibration can reduce overconfident predictions of deep neural networks, but can calibration also accelerate training by selecting the right samples? In this paper, we show that it can. We study the effect of popular calibration techniques in selecting better subsets of samples during training (also called sample prioritization) and observe that calibration can improve the quality of subsets, reduce the number of examples per epoch (by at least 70%), and can thereby speed up the overall training process. We further study the effect of using calibrated pre-trained models coupled with calibration during training to guide sample prioritization, which again seems to improve the quality of samples selected. Link » Ganesh Tata · Gautham Krishna Gudur · Gopinath Chennupati · Mohammad Emtiyaz Khan 🔗 - Unmasking the Lottery Ticket Hypothesis: Efficient Adaptive Pruning for Finding Winning Tickets (Poster) []   link » Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that require less compute and memory but can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is a state of the art algorithm that can find such highly sparse matching subnetworks, known as winning tickets, that can be retrained from initialization or an early training stage. IMP operates by iterative cycles of training, masking a fraction of smallest magnitude weights, rewinding unmasked weights back to an early training point, and repeating. Despite its simplicity, the underlying principles for when and how IMP finds winning tickets remain elusive. In particular, what useful information does an IMP mask found at the end of training convey to a rewound network near the beginning of training? We find that—at higher sparsities—pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training encodes information about the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. We leverage this observation to design a simple adaptive pruning heuristic for speeding up the discovery of winning tickets and achieve a 30% reduction in computation time on CIFAR-100. These results make progress toward demystifying the existence of winning tickets with an eye towards enabling the development of more efficient pruning algorithms. Link » Mansheej Paul · Feng Chen · Brett Larsen · Jonathan Frankle · Surya Ganguli · Gintare Karolina Dziugaite 🔗 - Layover Intermediate Layer for Multi-Label Classification in Efficient Transfer Learning (Poster) []   link » Transfer Learning (TL) is a promising technique to improve the performance of a target task by transferring the knowledge of models trained on relevant source datasets. With the advent of advanced depth models, various methods of exploiting pre-trained depth models at a large scale have come into the limelight. However, for multi-label classification tasks, TL approaches suffer from performance degradation in correctly predicting multiple objects in an image with significant size differences. Since such a hard instance contains imperceptible objects, most pre-trained models lose their ability during downsampling. For the hard instance, this paper proposes a simple but effective classifier for multiple predictions by using the hidden representations from the fixed backbone. To this end, we mix the pre-logit with the intermediate representation with a learnable scale. We show that our method is effective as fine-tuning with few additional parameters, and is particularly advantageous for hard instances. Link » Seongha Eom · Taehyeon Kim · Se-Young Yun 🔗 - A Scalable Technique for Weak-Supervised Learning with Domain Constraints (Poster) []  []   link » We propose a novel scalable end-to-end pipeline that uses symbolic domain knowledge as constraints for learning a neural network for classifying unlabeled data in a weak-supervised manner. Our approach is particularly well-suited for settings where the data consists of distinct groups (classes) that lends itself to clustering-friendly representation learning and the domain constraints can be reformulated for use of efficient mathematical optimization techniques by considering multiple training examples at once. We evaluate our approach on a variant of the MNIST image classification problem where a training example consists of image sequences and the sum of the numbers represented by the sequences, and show that our approach scales significantly better than previous approaches that rely on computing all constraint satisfying combinations for each training example. Link » Sudhir Agarwal · Anu Sreepathy · Lalla M 🔗 - IMPON: Efficient IMPortance sampling with ONline regression for rapid neural network training (Poster) []   link » Modern-day deep learning models are trained efficiently at scale thanks to thewidespread use of stochastic optimizers such as SGD and ADAM. These optimizersupdate the model weights iteratively based on a batch of uniformly sampledtraining data at each iteration. However, it has been previously observedthat the training performance and overall generalization ability of the model can besignificantly improved by selectively sampling training data based on animportance criteria, known as importance sampling. Previous approachesto importance sampling use metrics such as loss, gradient norm etc. to calculatethe importance scores. These methods either attempt to directly compute thesemetric, resulting in increased training time, or aim to approximate thesemetrics using an analytical proxy, which typically have inferior trainingperformance. In this work, we propose a new sampling strategy calledIMPON, which computes importance scores based on an auxiliarylinear model that regresses the loss of the original deep model, given thecurrent training context, with minimal additional computational cost.Experimental results show that IMPON is able to achieve a significantly hightest accuracy, much faster than prior approaches. Link » Vignesh Ganapathiraman · Francisco Calderon · Anila Joshi 🔗 - Relating Regularization and Generalization through the Intrinsic Dimension of Activations (Poster) []   link » Given a pair of models with similar training set performance, it is natural to assume that the model that possesses simpler internal representations would exhibit better generalization. In this work, we provide empirical evidence for this intuition through an analysis of the intrinsic dimension (ID) of model activations, which can be thought of as the minimal number of factors of variation in the model's representation of the data. First, we show that common regularization techniques uniformly decrease the last-layer ID (LLID) of validation set activations for image classification models and show how this strongly affects model generalization performance. We also investigate how excessive regularization decreases a model's ability to extract features from data in earlier layers, leading to a negative effect on validation accuracy even while LLID continues to decrease and training accuracy remains near-perfect. Finally, we examine the LLID over the course of training of models that exhibit grokking. We observe that well after training accuracy saturates, when models grok'' and validation accuracy suddenly improves from random to perfect, there is a co-occurent sudden drop in LLID, thus providing more insight into the dynamics of sudden generalization. Link » Bradley Brown · Jordan Juravsky · Anthony Caterini · Gabriel Loaiza-Ganem 🔗 - The Slingshot Mechanism: An Empirical Study of Adaptive Optimizers and the \emph{Grokking Phenomenon} (Poster) []  []   link » The \emph{grokking phenomenon} reported by Power et al.~\cite{power2021grokking} refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the \emph{Slingshot Mechanism}. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. We empirically observe that without explicit regularization, Grokking as reported in \cite{power2021grokking} almost exclusively happens at the onset of \emph{Slingshots}, and is absent without it. While common and easily reproduced in more general settings, the Slingshot Mechanism does not follow from any known optimization theories that we are aware of, and can be easily overlooked without an in depth examination. Our work points to a surprising and useful inductive bias of adaptive gradient optimizers at late stages of training, calling for a revised theoretical analysis of their origin. Link » Vimal Thilak · Etai Littwin · Shuangfei Zhai · Omid Saremi · Roni Paiss · Joshua Susskind 🔗 - Breadth-first pipeline parallelism (Poster) []   link » We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers the training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism. Experimentally, we observed increases of up to 53% in training speed. Link » Joel Lamy-Poirier 🔗 - Fishy: Layerwise Fisher Approximation for Higher-order Neural Network Optimization (Poster) []   link » We introduce Fishy, a local approximation of the Fisher information matrix at each layer for natural gradient descent training of deep neural networks. The true Fisher approximation for deep networks involves sampling labels from the model's predictive distribution at the output layer and performing a full backward pass -- Fishy defines a Bregman exponential family distribution at each layer, performing the sampling locally. Local sampling allows for model parallelism when forming the preconditioner, removing the need for the extra backward pass. We demonstrate our approach through the Shampoo optimizer, replacing its preconditioner gradients with our locally sampled gradients. Our training results on deep autoencoder and VGG16 image classification models indicate the efficacy of our construction. Link » Abel Peirson · Ehsan Amid · Yatong Chen · Vladimir Feinberg · Manfred Warmuth · Rohan Anil 🔗 - Fast Implicit Constrained Optimization of Non-decomposable Objectives for Deep Networks (Poster) []   link » We consider a popular family of constrained optimization problems in machine learning that involve optimizing a non-decomposable objective while constraining another. Different from the previous approach that expresses the classifier thresholds as a function of all model parameters, we consider an alternative strategy where the thresholds are expressed as a function of only a subset of the model parameters, i.e., the last layer of the neural network. We propose new training procedures that optimize for the bottom and last layers separately, and solve them using standard gradient based methods. Experiments on a benchmark dataset demonstrate our proposed method achieves performance comparable to the existing approach while being computationally efficient. Link » Yatong Chen · Abhishek Kumar · Yang Liu · Ehsan Amid 🔗 - Efficient regression with deep neural networks: how many datapoints do we need? (Poster) []   link » While large datasets facilitate the learning of a robust representation of the data manifold, the ability to obtain similar performance over small datasets is clearly computationally advantageous. This work considers deep neural networks for regression and aims to better understand how to select datapoints to minimize the neural network training time; a particular focus is on gaining insight into the structure and amount of datapoints needed to learn a robust function representation and how the training time varies for deep and wide architectures. Link » Daniel Lengyel · Anastasia Borovykh 🔗 - Perturbing BatchNorm and Only BatchNorm Benefits Sharpness-Aware Minimization (Poster) []  []   link » We investigate the connection between two popular methods commonly used in training deep neural networks: Sharpness-Aware Minimization (SAM) and Batch Normalization. We find that perturbing \textit{only} the affine BatchNorm parameters in the adversarial step of SAM benefits the generalization performance, while excluding them can decrease the performance strongly. We confirm our results across several models and SAM-variants on CIFAR-10 and CIFAR-100 and show preliminary results for ImageNet. Our results provide a practical tweak for training deep networks, but also cast doubt on the commonly accepted explanation of SAM minimizing a sharpness quantity responsible for generalization. Link » Maximilian Mueller · Matthias Hein 🔗 - Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging (Poster) []   link » Training vision or language models on large datasets can take days, if not weeks. We show that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103, respectively. Link » Jean Kaddour 🔗 - Batch size selection by stochastic optimal contro (Poster) []   link » SGD and its variants are widespread in the field of machine learning. Although there is extensive research on the choice of step-size schedules to guarantee convergence of these methods, there is substantially less work examining the influence of the batch size on optimization. The latter is typically kept constant and chosen via experimental validation.\\ In this work we take a stochastic optimal control perspective to understand the effect of the batch size when optimizing non-convex functions with SGD. Specifically, we define an optimal control problem, which considers the \emph{entire} trajectory of SGD to choose the optimal batch size for a noisy quadratic model. We show that the batch size is inherently coupled with the step size and that for saddles there is a transition-time $t^*$, after which it is beneficial to increase the batch size to reduce the covariance of the stochastic gradients. We verify our results empirically on various convex and non-convex problems. Link » Jim Zhao · Aurelien Lucchi · Frank Proske · Antonio Orvieto · Hans Kersting 🔗 - MC-DARTS : Model Size Constrained Differentiable Architecture Search (Poster) []  []   link » Recently, extensive research has been conducted on automated machine learning(AutoML). Neural architecture search (NAS) in AutoML is a crucial method for automatically optimizing neural network architectures according to applying data and its usage. One of the prospected ways to search for a high accuracy model is the gradient method NAS, known as differentiable architecture search (DARTS). Previous DARTS-based studies have proposed that the size of the optimal architecture depends on the size of the dataset. If the optimal size of the architecture is small, the search for a large model size architecture is unnecessary. The size of the architectures must be considered when deep learning is used on mobile devices and embedded systems since the memory on these platforms is limited. Therefore, in this paper, we propose a novel approach, known as model size constrained DARTS. The proposed approach adds constraints to DARTS to search for a network architecture, considering the accuracy and the model size. As a result, the proposed method can efficiently search for network architectures with short training time and high accuracy under constrained conditions. Link » Kazuki Hemmi · Yuki Tanigaki · Masaki Onishi 🔗 - Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models (Poster) []   link » Adaptive gradient algorithms combine the moving average idea with heavy ball acceleration to estimate accurate first- and second-order moments of the gradient for accelerating convergence. But Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases, is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm (Adan) to speed up the training of deep neural networks. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method that avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order gradient moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate stationary point within $\order{\epsilon^{-4}}$ stochastic gradient complexity on the non-convex stochastic problems, matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers for vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, ResNet, MAE, e.t.c, and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Link » Xingyu Xie · Pan Zhou · Huan Li · Zhouchen Lin · Shuicheng Yan 🔗 - Win: Weight-Decay-Integrated Nesterov Acceleration for Adaptive Gradient Algorithms (Poster) []   link » Training deep networks on increasingly large-scale datasets is computationally challenging. In this work, we explore the problem of \textit{how to accelerate the convergence of adaptive gradient algorithms in a general manner}", and aim at providing practical insights to boost the training efficiency. To this end, we propose an effective {Weight-decay-Integrated Nesterov acceleration} (Win) for adaptive algorithms to enhance their convergence speed. Taking AdamW and Adam as examples, we minimize a dynamical loss per iteration which combines the vanilla training loss and a dynamic regularizer inspired by proximal point method (PPM) to improve the convexity of the problem. Then we respectively use the first- and second-order Taylor approximations of vanilla loss to update the variable twice while fixing the above dynamic regularization brought by PPM. In this way, we arrive at our Win acceleration (like Nesterov acceleration) for AdamW and Adam that uses a conservative step and a reckless step to update twice and then linearly combines these two updates for acceleration. Next, we extend this Win acceleration to LAMB and SGD. Our transparent acceleration derivation could provide insights for other accelerated methods and their integration into adaptive algorithms. Besides, we prove the convergence of Win-accelerated adaptive algorithms and justify their convergence superiority over their non-accelerated counterparts by taking AdamW and Adam as examples. Experimental results testify the faster convergence speed and superior performance of our Win-accelerated AdamW, Adam, LAMB and SGD over their vanilla counterparts on vision classification tasks and language modeling tasks with CNN and Transformer backbones. Link » Pan Zhou · Xingyu Xie · Shuicheng Yan 🔗 - Active Learning is a Strong Baseline for Data Subset Selection (Poster) []  []   link » Data subset selection refers to the process of finding a small subset of training data such that the predictive performance of a classifier trained on it is close to that of a classifier trained on the full training data. A variety of sophisticated algorithms have been proposed specifically for data subset selection. A closely related problem is the active learning problem developed for semi-supervised learning.The key step of active learning is to identify an important subset of unlabeled data by making use of the currently available labeled data. This paper starts with a simple observation -- one can apply any off-the-shelf active learning algorithm in the context of data subset selection. The idea is very simple -- we pick a small random subset of data and pretend as if this random subset is the only labeled data, and the rest is not labeled. By pretending so, one can simply apply any off-the-shelf active learning algorithm. After each step of sample selection, we can reveal the label of the selected samples (as if we label the chosen samples in the original active learning scenario) and continue running the algorithm until one reaches the desired subset size. We observe that surprisingly, this active learning-based algorithm outperforms all the current data subset selection algorithms on the benchmark tasks. We also perform a simple controlled experiment to understand better why this approach works well. As a result, we find that it is crucial to find a balance between easy-to-classify and hard-to-classify examples when selecting a subset. Link » Dongmin Park · Dimitris Papailiopoulos · Kangwook Lee 🔗 - APE: Aligning Pretrained Encoders to Quickly Learn Aligned Multimodal Representations (Poster) []   link » Recent advances in learning aligned multimodal representations have been primarily driven by training large neural networks on massive, noisy paired-modality datasets. In this work, we ask whether it is possible to achieve similar results with substantially less training time and data. We achieve this by taking advantage of existing pretrained unimodal encoders and careful curation of alignment data relevant to the downstream task of interest. We study a natural approach to aligning existing encoders via small auxiliary functions, and we find that this method is competitive with (or outperforms) state of the art in many settings while being less prone to overfitting, less costly to train, and more robust to distribution shift. With a carefully chosen alignment distribution, our method surpasses prior state of the art for ImageNet zero-shot classification on public data while using two orders of magnitude less time and data and training 77% fewer parameters. Link » Elan Rosenfeld · Preetum Nakkiran · Hadi Pouransari · Oncel Tuzel · Fartash Faghri 🔗 - LOFT: Finding Lottery Tickets through Filter-wise Training (Poster) []   link » In this paper, we explore how one can efficiently identify the emergence of winning tickets'' using distributed training techniques, and use this observation to design efficient pretraining algorithms. Our focus in this work is on convolutional neural networks (CNNs), which are more complex than simple multi-layer perceptrons, but simple enough to exposure our ideas. To identify good filters within winning tickets, we propose a novel filter distance metric that well-represents the model convergence, without the need to know the true winning ticket or fully training the model. Our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by such analysis, we present the \emph{LOttery ticket through Filter-wise Training} algorithm, dubbed as \textsc{LoFT}. \textsc{LoFT} is a model-parallel pretraining algorithm that partitions convolutional layers in CNNs by filters to train them independently on different distributed workers, leading to reduced memory and communication costs during pretraining. Experiments show that \textsc{LoFT} $i)$ preserves and finds good lottery tickets, while $ii)$ it achieves non-trivial savings in computation and communication, and maintains comparable or even better accuracy than other pretraining methods. Link » Qihan Wang · Chen Dun · Fangshuo Liao · Christopher Jermaine · Anastasios Kyrillidis 🔗 - Trajectory ensembling for fine tuning - performance gains without modifying training (Poster) []  []   link » In this work, we present a simple algorithm for ensembling checkpoints from a single training trajectory (trajectory ensembling) resulting in significant gains for several fine tuning tasks. We compare against classical ensembles and perform ablation studies showing that the important checkpoints are not necessarily the best performing models in terms of accuracy. Rather, relatively poor models with low loss are vital for the observed performance gains. We also investigate various mixtures of checkpoints from several independent training trajectories, making the surprising observation that this only leads to marginal gains in this setup. We study how calibrating constituent models with a simple temperature scaling impacts results, and find that the most important region of training is still that of the lowest loss in spite of potential poor accuracy. Link » Louise Anderson-Conway · Vighnesh Birodkar · Saurabh Singh · Hossein Mobahi · Alexander Alemi 🔗 - Training a Vision Transformer from scratch in less than 24 hours with 1 GPU (Poster) []  []   link » Transformers have become central to recent advances in computer vision.However, training a vision Transformer (ViT) model from scratch can be resource intensive and time consuming.In this paper, we aim to explore approaches to reduce the training costs of ViT models.We introduce some algorithmic improvements to enable training a ViT model from scratch with limited hardware (1 GPU) and time (24 hours) resources.First, we propose an efficient approach to add locality to the ViT architecture.Second, we develop a new image size curriculum learning strategy, which allows to reduce the number of patches extracted from each image at the beginning of the training.Finally, we propose a new variant of the popular ImageNet1k benchmark by adding hardware and time constraints. We evaluate our contributions on this benchmark, and show they can significantly improve performances given the proposed training budget. Link » Saghar Irandoust · Thibaut Durand · Yunduz Rakhmangulova · Wenjie Zi · Hossein Hajimirsadeghi 🔗 - PyHopper - A Plug-and-Play Hyperparameter Optimization Engine (Poster) []  []   link » Hyperparameter tuning is a fundamental aspect of machine learning research. Setting up the infrastructure for systematic optimization of hyperparameters can take a significant amount of time.Here, we present PyHopper, an open-source black-box optimization platform designed to streamline the hyperparameter tuning workflow of machine learning research. PyHopper's goal is to integrate with existing code with minimal effort and run the optimization process with minimal necessary manual oversight. With simplicity as the primary theme, PyHopper is powered by a single robust Markov-chain Monte-Carlo optimization algorithm that scales to millions of dimensions. Compared to existing tuning packages, focusing on a single algorithm frees the user from having to decide between several algorithms and makes PyHopper easily customizable.PyHopper is publicly available under the Apache-2.0 license at (omitted for anonymity) Link » Mathias Lechner · Ramin Hasani · Sophie Neubauer · Philipp Neubauer · Daniela Rus 🔗 - Faster and Cheaper Energy Demand Forecasting at Scale (Poster) []  []   link » Energy demand forecasting is one of the most challenging tasks for grids operators. Many approaches have been suggested over the years to tackle it. Yet, those still remain too expensive to train in terms of both time and computational resources, hindering their adoption as customers behaviors are continuously evolving.We introduce Transplit, a new lightweight transformer-based model, which significantly decreases this cost by exploiting the seasonality property and learning typical days of power demand. We show that Transplit can be run efficiently on CPU and is several hundred times faster than state-of-the-art predictive models, while performing as well. Link » Fabien Bernier · Matthieu Jimenez · Maxime Cordy · YVES LE TRAON 🔗 - Late-Phase Second-Order Training (Poster) []  []   link » Towards the end of training, stochastic first-order methods such as SGD and Adam go into diffusion and no longer make significant progress. In contrast, Newton-type methods are highly efficient “close” to the optimum, in the deterministic case. Therefore, these methods might turn out to be a particularly efficient tool for the final phase of training in the stochastic deep learning context as well. In our work, we study this idea by conducting an empirical comparison of a second-order Hessian-free optimizer and different first-order strategies with learning rate decays for late-phase training. We show that performing a few costly but precise second-order steps can outperform first-order alternatives in wall-clock runtime. Link » Lukas Tatzel · Philipp Hennig · Frank Schneider 🔗 - SADT: Combining Sharpness-Aware Minimization with Self-Distillation for Improved Model Generalization (Poster) []  []   link » Methods for improving deep neural network training times and model generalizability consist of various data augmentation, regularization, and optimization approaches, which tend to be sensitive to hyperparameter settings and make reproducibility more challenging. This work jointly considers two recent training strategies that address model generalizability: sharpness-aware minimization, and self-distillation, and proposes the novel training strategy of Sharpness-Aware Distilled Teachers (SADT). The experimental section of this work shows that SADT consistently outperforms previously published training strategies in model convergence time, test-time performance, and model generalizability over various neural architectures, datasets, and hyperparameter settings. Link » MASUD AN NUR ISLAM FAHIM · Jani Boutellier 🔗 - Learnable Graph Convolutional Attention Networks (Poster) []   link » Existing Graph Neural Networks (GNNs) compute the message exchange between nodes by either convolving the features of all the neighboring nodes (GCNs), or by applying attention instead (GATs). In this work, we aim at exploiting the strengths of both approaches to their full extent. To this end, we first introduce a graph convolutional attention layer (CAT), which relies on convolutions to compute the attention scores, and theoretically show that there is no clear winner between the three models, as their performance depends on the nature of the data. This brings us to our main contribution, the learnable graph convolutional attention network (L-CAT): a GNN architecture that automatically interpolates between GCN, GAT and CAT in each layer, by introducing two additional (scalar) parameters. Our results demonstrate that L-CAT is able to efficiently combine different GNN layers along the network, outperforming competing methods in a wide range of datasets, and resulting in a more robust model that reduces the need of cross-validating. Link » Adrián Javaloy · Pablo Sanchez Martin · Amit Levi · Isabel Valera 🔗 - When & How to Transfer with Transfer Learning (Poster) []   link » In deep learning, transfer learning (TL) has become the de facto approach when dealing with image related tasks. Visual features learnt for one task have been shown to be reusable for other tasks, improving performance significantly. By reusing deep representations, TL enables the use of deep models in domains with limited data availability, limited computational resources and/or limited access to human experts. Domains which include the vast majority of real-life applications. This paper conducts an experimental evaluation of TL, exploring its trade-offs with respect to performance, environmental footprint, human hours and computational requirements. Results highlight the cases were a cheap feature extraction approach is preferable, and the situations where a expensive fine-tuning effort may be worth the added cost. Finally, a set of guidelines on the use of TL are proposed. Link » Adrián Tormos · Dario Garcia-Gasulla · Victor Gimenez-Abalos · Sergio Alvarez-Napagao 🔗 - FastCPH: Efficient Survival Analysis for Neural Networks (Poster) []   link » The Cox proportional hazards model is a canonical method in survival analysis for prediction of the life expectancy of a patient given clinical or genetic covariates -- it is a linear model in its original form. In recent years, several methods have been proposed to generalize the Cox model to neural networks, but none of these are both numerically correct and computationally efficient. We propose FastCPH, a new method that runs in linear time and supports both the standard Breslow and Efron methods for tied events. We also demonstrate the performance of FastCPH combined with LassoNet, a neural network that provides interpretability through feature sparsity, on survival datasets. The final procedure is efficient, selects useful covariates and outperforms existing CoxPH approaches. Link » Xuelin Yang · Louis F Abraham · Sejin Kim · Petr Smirnov · Feng Ruan · Benjamin Haibe-Kains · Robert Tibshirani 🔗 - Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning Rates (Poster) []  []   link » Benchmarking the tradeoff between neural network accuracy and training time is computationally expensive. Here we show how a multiplicative cyclic learning rate schedule can be used to construct a tradeoff curve in a single training run. We generate cyclic tradeoff curves for combinations of training methods such as Blurpool, Channels Last, Label Smoothing and MixUp, and highlight how these cyclic tradeoff curves can be used to efficiently evaluate the effects of algorithmic choices on network training. Link » Jacob Portes · Davis Blalock · Cory Stephenson · Jonathan Frankle 🔗 - Feature Encodings for Gradient Boosting with Automunge (Poster) []  []   link » Selecting a default feature encoding strategy for gradient boosted learning may consider metrics of training duration and achieved predictive performance associated with the feature representations. The Automunge library for dataframe preprocessing offers a default of binarization for categoric features and z-score normalization for numeric. The presented study sought to validate those defaults by way of benchmarking on a series of diverse data sets by encoding variations with tuned gradient boosted learning. We found that on average our chosen defaults were top performers both from a tuning duration and a model performance standpoint. Another key finding was that one hot encoding did not perform in a manner consistent with suitability to serve as a categoric default in comparison to categoric binarization. We present here these and further benchmarks. Link » Nicholas Teague 🔗