Track: Deep Learning, Applications

Tue 5 Dec. 16:20 - 16:35 PST

Oral

Unsupervised learning of object frames by dense equivariant image labelling

James Thewlis · Hakan Bilen · Andrea Vedaldi

One of the key challenges of visual perception is to extract abstract models of 3D objects and object categories from visual measurements, which are affected by complex nuisance factors such as viewpoint, occlusion, motion, and deformations. Starting from the recent idea of viewpoint factorization, we propose a new approach that, given a large number of images of an object and no other supervision, can extract a dense object-centric coordinate frame. This coordinate frame is invariant to deformations of the images and comes with a dense equivariant labelling neural network that can map image pixels to their corresponding object coordinates. We demonstrate the applicability of this method to simple articulated objects and deformable objects such as human faces, learning embeddings from random synthetic transformations or optical flow correspondences, all without any manual supervision.

Tue 5 Dec. 16:35 - 16:50 PST

Oral

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Raymond A. Yeh · Jinjun Xiong · Wen-Mei Hwu · Minh Do · Alex Schwing

Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection of the solution from a set of bounding box proposals obtained from deep net based systems. In this work, we demonstrate that we can cast the problem of textual grounding into a unified framework that permits efficient search over all possible bounding boxes. Hence, we able to consider significantly more proposals and, due to the unified formulation, our approach does not rely on a successful first stage. Beyond, we demonstrate that the trained parameters of our model can be used as word-embeddings which capture spatial-image relationships and provide interpretability. Lastly, our approach outperforms the current state-of-the-art methods on the Flickr 30k Entities and the ReferItGame dataset by 3.08 and 7.77 respectively.

Tue 5 Dec. 16:50 - 17:05 PST

Oral

Eigen-Distortions of Hierarchical Representations

Alexander Berardino · Valero Laparra · Johannes Ballé · Eero Simoncelli

We develop a method for comparing hierarchical image representations in terms of their ability to explain perceptual sensitivity in humans. Specifically, we utilize Fisher information to establish a model-derived prediction of local sensitivity to perturbations around a given natural image. For a given image, we compute the eigenvectors of the Fisher information matrix with largest and smallest eigenvalues, corresponding to the model-predicted most- and least-noticeable image distortions, respectively. For human subjects, we then measure the amount of each distortion that can be reliably detected when added to the image, and compare these thresholds to the predictions of the corresponding model. We use this method to test the ability of a variety of representations to mimic human perceptual sensitivity. We find that the early layers of VGG16, a deep neural network optimized for object recognition, provide a better match to human perception than later layers, and a better match than a 4-stage convolutional neural network (CNN) trained on a database of human ratings of distorted image quality. On the other hand, we find that simple models of early visual processing, incorporating one or more stages of local gain control, trained on the same database of distortion ratings, predict human sensitivity significantly better than both the CNN and all layers of VGG16.

Tue 5 Dec. 17:05 - 17:10 PST

Spotlight

Towards Accurate Binary Convolutional Neural Network

Xiaofan Lin · Cong Zhao · Wei Pan

We introduce a novel scheme to train binary convolutional neural networks (CNNs) -- CNNs with weights and activations constrained to {-1,+1} at run-time. It has been known that using binary weights and activations drastically reduce memory size and accesses, and can replace arithmetic operations with more efficient bitwise operations, leading to much faster test-time inference and lower power consumption. However, previous works on binarizing CNNs usually result in severe prediction accuracy degradation. In this paper, we address this issue with two major innovations: (1) approximating full-precision weights with the linear combination of multiple binary weight bases; (2) employing multiple binary activations to alleviate information loss. The implementation of the resulting binary CNN, denoted as ABC-Net, is shown to achieve much closer performance to its full-precision counterpart, and even reach the comparable prediction accuracy on ImageNet and forest trail datasets, given adequate binary weight bases and activations.

Tue 5 Dec. 17:10 - 17:15 PST

Spotlight

Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model

Xingjian Shi · Zhihan Gao · Leonard Lausen · Hao Wang · Dit-Yan Yeung · Wai-kin Wong · Wang-chun WOO

With the goal of making high-resolution forecasts of regional rainfall, precipitation nowcasting has become an important and fundamental technology underlying various public services ranging from rainfall alerts to flight safety. Recently, the convolutional LSTM (ConvLSTM) model has been shown to outperform traditional optical flow based methods for precipitation nowcasting, suggesting that deep learning models have a huge potential for solving the problem. However, the convolutional recurrence structure in ConvLSTM-based models is location-invariant while natural motion and transformation (e.g., rotation) are location-variant in general. Furthermore, since deep-learning-based precipitation nowcasting is a newly emerging area, clear evaluation protocols have not yet been established. To address these problems, we propose both a new model and a benchmark for precipitation nowcasting. Specifically, we go beyond ConvLSTM and propose the Trajectory GRU (TrajGRU) model that can actively learn the location-variant structure for recurrent connections. Besides, we provide a benchmark that includes a real-world large-scale dataset from the Hong Kong Observatory, a new training loss, and a comprehensive evaluation protocol to facilitate future research and gauge the state of the art.

Tue 5 Dec. 17:15 - 17:20 PST

Spotlight

Poincaré Embeddings for Learning Hierarchical Representations

Maximilian Nickel · Douwe Kiela

Representation learning has become an invaluable approach for learning from symbolic data such as text and graphs. However, while complex symbolic datasets often exhibit a latent hierarchical structure, state-of-the-art methods typically learn embeddings in Euclidean vector spaces, which do not account for this property. For this purpose, we introduce a new approach for learning hierarchical representations of symbolic data by embedding them into hyperbolic space -- or more precisely into an n-dimensional Poincaré ball. Due to the underlying hyperbolic geometry, this allows us to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. We introduce an efficient algorithm to learn the embeddings based on Riemannian optimization and show experimentally that Poincaré embeddings outperform Euclidean embeddings significantly on data with latent hierarchies, both in terms of representation capacity and in terms of generalization ability.

Tue 5 Dec. 17:20 - 17:25 PST

Spotlight

Deep Hyperspherical Learning

Weiyang Liu · Yan-Ming Zhang · Xingguo Li · Zhiding Yu · Bo Dai · Tuo Zhao · Le Song

Convolution as inner product has been the founding basis of convolutional neural networks (CNNs) and the key to end-to-end visual representation learning. Benefiting from deeper architectures, recent CNNs have demonstrated increasingly strong representation abilities. Despite such improvement, the increased depth and larger parameter space have also led to challenges in properly training a network. In light of such challenges, we propose hyperspherical convolution (SphereConv), a novel learning framework that gives angular representations on hyperspheres. We introduce SphereNet, deep hyperspherical convolution networks that are distinct from conventional inner product based convolutional networks. In particular, SphereNet adopts SphereConv as its basic convolution operator and is supervised by generalized angular softmax loss - a natural loss formulation under SphereConv. We show that SphereNet can effectively encode discriminative representation and alleviate training difficulty, leading to easier optimization, faster convergence and better classification performance over convolutional counterparts. We also provide some theoretical justifications for the advantages on hyperspherical optimization. Experiments and ablation studies have verified our conclusion.

Tue 5 Dec. 17:25 - 17:30 PST

Spotlight

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

Alex Kendall · Yarin Gal

There are two major types of uncertainty one can model. Aleatoric uncertainty captures noise inherent in the observations. On the other hand, epistemic uncertainty accounts for uncertainty in the model -- uncertainty which can be explained away given enough data. Traditionally it has been difficult to model epistemic uncertainty in computer vision, but with new Bayesian deep learning tools this is now possible. We study the benefits of modeling epistemic vs. aleatoric uncertainty in Bayesian deep learning models for vision tasks. For this we present a Bayesian deep learning framework combining input-dependent aleatoric uncertainty together with epistemic uncertainty. We study models under the framework with per-pixel semantic segmentation and depth regression tasks. Further, our explicit uncertainty formulation leads to new loss functions for these tasks, which can be interpreted as learned attenuation. This makes the loss more robust to noisy data, also giving new state-of-the-art results on segmentation and depth regression benchmarks.

Tue 5 Dec. 17:30 - 17:35 PST

Spotlight

One-Sided Unsupervised Domain Mapping

Sagie Benaim · Lior Wolf

In unsupervised domain mapping, the learner is given two unmatched datasets $A$ and $B$. The goal is to learn a mapping $G_{AB}$ that translates a sample in $A$ to the analog sample in $B$. Recent approaches have shown that when learning simultaneously both $G_{AB}$ and the inverse mapping $G_{BA}$, convincing mappings are obtained. In this work, we present a method of learning $G_{AB}$ without learning $G_{BA}$. This is done by learning a mapping that maintains the distance between a pair of samples. Moreover, good mappings are obtained, even by maintaining the distance between different parts of the same sample before and after mapping. We present experimental results that the new method not only allows for one sided mapping learning, but also leads to preferable numerical results over the existing circularity-based constraint. Our entire code will be made publicly available.

Tue 5 Dec. 17:35 - 17:40 PST

Spotlight

Deep Mean-Shift Priors for Image Restoration

Siavash Arjomand Bigdeli · Matthias Zwicker · Paolo Favaro · Meiguang Jin

In this paper we introduce a natural image prior that directly represents a Gaussian-smoothed version of the natural image distribution. We include our prior in a formulation of image restoration as a Bayes estimator that also allows us to solve noise-blind image restoration problems. The gradient of a bound of our estimator involves the gradient of the logarithm of our prior. This gradient corresponds to the mean-shift vector on the natural image distribution, and we learn the mean-shift vector field using denoising autoencoders. We demonstrate competitive results for noise-blind deblurring, super-resolution, and demosaicing.

Tue 5 Dec. 17:40 - 17:45 PST

Spotlight

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Andrew Gibiansky · Sercan Arik · Gregory Diamos · John Miller · Kainan Peng · Wei Ping · Jonathan Raiman · Yanqi Zhou

We introduce a technique for augmenting neural text-to-speech (TTS) with low-dimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-of-the-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1. We improve Tacotron by introducing a post-processing neural vocoder, and demonstrate a significant audio quality improvement. We then demonstrate our technique for multi-speaker speech synthesis for both Deep Voice 2 and Tacotron on two multi-speaker TTS datasets. We show that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

Tue 5 Dec. 17:45 - 17:50 PST

Spotlight

Graph Matching via Multiplicative Update Algorithm

Bo Jiang · Jin Tang · Chris Ding · Yihong Gong · Bin Luo

Graph matching is a fundamental problem in computer vision and machine learning area. This problem can usually be formulated as a Quadratic Programming (QP) problem with doubly stochastic and discrete (integer) constraints. Since it is NP-hard, approximate algorithms are required. In this paper, we present a new algorithm, called Multiplicative Update Graph Matching (MPGM), that develops a multiplicative update technique to solve the QP matching problem. MPGM has three main benefits: (1) theoretically, MPGM solves the general QP problem with doubly stochastic constraint naturally and directly whose convergence and KKT optimality are guaranteed. (2) Empirically, MPGM generally returns a sparse solution and thus can also incorporate the discrete constraint approximately in its optimization. (3) It is efficient and simple to implement. Experiments on both synthetic and real-world matching tasks show the benefits of MPGM algorithm.

Tue 5 Dec. 17:50 - 17:55 PST

Spotlight

Dynamic Routing Between Capsules

Sara Sabour · Nicholas Frosst · Geoffrey E Hinton

A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation paramters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

Tue 5 Dec. 17:55 - 18:00 PST

Spotlight

Modulating early visual processing by language

Harm de Vries · Florian Strub · Jeremie Mary · Hugo Larochelle · Olivier Pietquin · Aaron Courville

It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view dominates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the \emph{entire visual processing} by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network on a language embedding. This approach, which we call MODulated Residual Networks (\MRN), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial.

Main Navigation

Session

Deep Learning, Applications

Unsupervised learning of object frames by dense equivariant image labelling

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Eigen-Distortions of Hierarchical Representations

Towards Accurate Binary Convolutional Neural Network

Deep Learning for Precipitation Nowcasting: A Benchmark and A New Model

Poincaré Embeddings for Learning Hierarchical Representations

Deep Hyperspherical Learning

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?

One-Sided Unsupervised Domain Mapping

Deep Mean-Shift Priors for Image Restoration

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

Graph Matching via Multiplicative Update Algorithm

Dynamic Routing Between Capsules

Modulating early visual processing by language