Session
Spotlights
Tracking Dynamic Sources of Malicious Activity at Internet Scale
Shobha Venkataraman · Avrim Blum · Dawn Song · Subhabrata Sen · Oliver Spatscheck
We formulate and address the problem of discovering dynamic malicious regions on the Internet. We model this problem as one of adaptively pruning a known decision tree, but with additional challenges: (1) severe space requirements, since the underlying decision tree has over 4 billion leaves, and (2) a changing target function, since malicious activity on the Internet is dynamic. We present a novel algorithm that addresses this problem, by putting together a number of different ``experts algorithms and online paging algorithms. We prove guarantees on our algorithms performance as a function of the best possible pruning of a similar size, and our experiments show that our algorithm achieves high accuracy on large real-world data sets, with significant improvements over existing approaches.
Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining
George Konidaris · Andrew G Barto
We introduce skill chaining, a skill discovery method for reinforcement learning agents in continuous domains, that builds chains of skills leading to an end-of-task reward. We demonstrate experimentally that it creates skills that result in performance benefits in a challenging continuous domain.
Efficient Match Kernel between Sets of Features for Visual Recognition
Liefeng Bo · Cristian Sminchisescu
In visual recognition, the images are frequently modeled as sets of local features (bags). We show that bag of words, a common method to handle such cases, can be viewed as a special match kernel, which counts 1 if two local features fall into the same regions partitioned by visual words and 0 otherwise. Despite its simplicity, this quantization is too coarse. It is, therefore, appealing to design match kernels that more accurately measure the similarity between local features. However, it is impractical to use such kernels on large datasets due to their significant computational cost. To address this problem, we propose an efficient match kernel (EMK), which maps local features to a low dimensional feature space, average the resulting feature vectors to form a set-level feature, then apply a linear classifier. The local feature maps are learned so that their inner products preserve, to the best possible, the values of the specified kernel function. EMK is linear both in the number of images and in the number of local features. We demonstrate that EMK is extremely efficient and achieves the current state of the art performance on three difficult real world datasets: Scene-15, Caltech-101 and Caltech-256.
Orthogonal Matching Pursuit From Noisy Random Measurements: A New Analysis
Alyson Fletcher · Sundeep Rangan
Orthogonal matching pursuit (OMP) is a widely used greedy algorithm for recovering sparse vectors from linear measurements. A well-known analysis of Tropp and Gilbert shows that OMP can recover a k-sparse n-dimensional real vector from m = 4k log(n) noise-free random linear measurements with a probability that goes to one as n goes to infinity. This work shows strengthens this result by showing that a lower number of measurements, m = 2k log(n-k), is in fact sufficient for asymptotic recovery. Moreover, this number of measurements is also sufficient for detection of the sparsity pattern (support) of the vector with measurement errors provided the signal-to-noise ratio (SNR) scales to infinity. The scaling m = 2k log(n-k) exactly matches the number of measurements required by the more complex lasso for signal recovery.
Learning to Hash with Binary Reconstructive Embeddings
Brian Kulis · Trevor Darrell
Fast retrieval methods are increasingly critical for many large-scale analysis tasks, and there have been several recent methods that attempt to learn hash functions for fast and accurate nearest neighbor searches. In this paper, we develop an algorithm for learning hash functions based on explicitly minimizing the reconstruction error between the original distances and the Hamming distances of the corresponding binary embeddings. We develop a scalable coordinate-descent algorithm for our proposed hashing objective that is able to efficiently learn hash functions in a variety of settings. Unlike existing methods such as semantic hashing and spectral hashing, our method is easily kernelized and does not require restrictive assumptions about the underlying distribution of the data. We present results over several domains to demonstrate that our method outperforms existing state-of-the-art techniques.
Perceptual Multistability as Markov Chain Monte Carlo Inference
Samuel J Gershman · Edward Vul · Josh Tenenbaum
While many perceptual and cognitive phenomena are well described in terms of Bayesian inference, the necessary computations are intractable at the scale of real-world tasks, and it remains unclear how the human mind approximates Bayesian inference algorithmically. We explore the proposal that for some tasks, humans use a form of Markov Chain Monte Carlo to approximate the posterior distribution over hidden variables. As a case study, we show how several phenomena of perceptual multistability can be explained as MCMC inference in simple graphical models for low-level vision.
Information-theoretic lower bounds on the oracle complexity of convex optimization
Alekh Agarwal · Peter Bartlett · Pradeep Ravikumar · Martin J Wainwright
Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to the understanding the inherent complexity of large-scale learning and estimation problems.
Hierarchical Learning of Dimensional Biases in Human Categorization
Katherine Heller · Adam Sanborn · Nick Chater
Existing models of categorization typically represent to-be-classified items as points in a multidimensional space. While from a mathematical point of view, an infinite number of basis sets can be used to represent points in this space, the choice of basis set is psychologically crucial. People generally choose the same basis dimensions, and have a strong preference to generalize along the axes of these dimensions, but not "diagonally". What makes some choices of dimension special? We explore the idea that the dimensions used by people echo the natural variation in the environment. Specifically, we present a rational model that does not assume dimensions, but learns the same type of dimensional generalizations that people display. This bias is shaped by exposing the model to many categories with a structure hypothesized to be like those which children encounter. Our model can be viewed as a type of transformed Dirichlet process mixture model, where it is the learning of the base distribution of the Dirichlet process which allows dimensional generalization.The learning behaviour of our model captures the developmental shift from roughly "isotropic" for children to the axis-aligned generalization that adults show.
Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness
Garvesh Raskutti · Martin J Wainwright · Bin Yu
This paper uses information-theoretic techniques to determine minimax rates for estimating nonparametric sparse additive regression models under high-dimensional scaling. We assume an additive decomposition of the form $f^*(X_1, \ldots, X_p) = \sum_{j \in S} h_j(X_j)$, where each component function $h_j$ lies in some Hilbert Space $\Hilb$ and $S \subset \{1, \ldots, \pdim \}$ is an unknown subset with cardinality $\s = |S$. Given $\numobs$ i.i.d. observations of $f^*(X)$ corrupted with white Gaussian noise where the covariate vectors $(X_1, X_2, X_3,...,X_{\pdim})$ are drawn with i.i.d. components from some distribution $\mP$, we determine tight lower bounds on the minimax rate for estimating the regression function with respect to squared $\LTP$ error. The main result shows that the minimax rates are $\max{\big(\frac{\s \log \pdim / \s}{n}, \LowerRateSq \big)}$. The first term reflects the difficulty of performing \emph{subset selection} and is independent of the Hilbert space $\Hilb$; the second term $\LowerRateSq$ is an \emph{\s-dimensional estimation} term, depending only on the low dimension $\s$ but not the ambient dimension $\pdim$, that captures the difficulty of estimating a sum of $\s$ univariate functions in the Hilbert space $\Hilb$. As a special case, if $\Hilb$ corresponds to the $\m$-th order Sobolev space $\SobM$ of functions that are $m$-times differentiable, the $\s$-dimensional estimation term takes the form $\LowerRateSq \asymp \s \; n^{-2\m/(2\m+1)}$. The minimax rates are compared with rates achieved by an $\ell_1$-penalty based approach, it can be shown that a certain $\ell_1$-based approach achieves the minimax optimal rate.
Neural Implementation of Hierarchical Bayesian Inference by Importance Sampling
Lei ShiUpdateMe · Tom Griffiths
The goal of perception is to infer the hidden states in the hierarchical process by which sensory data are generated. Human behavior is consistent with the optimal statistical solution to this problem in many tasks, including cue combination and orientation detection. Understanding the neural mechanisms underlying this behavior is of particular importance, since probabilistic computations are notoriously challenging. Here we propose a simple mechanism for Bayesian inference which involves averaging over a few feature detection neurons which fire at a rate determined by their similarity to a sensory stimulus. This mechanism is based on a Monte Carlo method known as importance sampling, commonly used in computer science and statistics. Moreover, a simple extension to recursive importance sampling can be used to perform hierarchical Bayesian inference. We identify a scheme for implementing importance sampling with spiking neurons, and show that this scheme can account for human behavior in cue combination and oblique effect.
Posterior vs Parameter Sparsity in Latent Variable Models
Joao V Graca · Kuzman Ganchev · Ben Taskar · Fernando Pereira
In this paper we explore the problem of biasing unsupervised models to favor sparsity. We extend the posterior regularization framework [8] to encourage the model to achieve posterior sparsity on the unlabeled training data. We apply this new method to learn first-order HMMs for unsupervised part-of-speech (POS) tagging, and show that HMMs learned this way consistently and significantly out-performs both EM-trained HMMs, and HMMs with a sparsity-inducing Dirichlet prior trained by variational EM. We evaluate these HMMs on three languages — English, Bulgarian and Portuguese — under four conditions. We find that our method always improves performance with respect to both baselines, while variational Bayes actually degrades performance in most cases. We increase accuracy with respect to EM by 2.5%-8.7% absolute and we see improvements even in a semisupervised condition where a limited dictionary is provided.
Structural inference affects depth perception in the context of potential occlusion
Ian H Stevenson · Konrad Koerding
In many domains, humans appear to combine perceptual cues in a near-optimal, probabilistic fashion: two noisy pieces of information tend to be combined linearly with weights proportional to the precision of each cue. Here we present a case where structural information plays an important role. The presence of a background cue gives rise to the possibility of occlusion, and places a soft constraint on the location of a target – in effect propelling it forward. We present an ideal observer model of depth estimation for this situation where structural or ordinal information is important and then fit the model to human data from a stereo-matching task. To test whether subjects are truly using ordinal cues in a probabilistic manner we then vary the uncertainty of the task. We find that the model accurately predicts shifts in subject’s behavior. Our results indicate that the nervous system estimates depth ordering in a probabilistic fashion and estimates the structure of the visual scene during depth perception.
Bilinear classifiers for visual recognition
Hamed Pirsiavash · Deva Ramanan · Charless Fowlkes
We describe an algorithm for learning bilinear SVMs. Bilinear classifiers are a discriminative variant of bilinear models, which capture the dependence of data on multiple factors. Such models are particularly appropriate for visual data that is better represented as a matrix or tensor, rather than a vector. Matrix encodings allow for more natural regularization through rank restriction. For example, a rank-one scanning-window classifier yields a separable filter. Low-rank models have fewer parameters and so are easier to regularize and faster to score at run-time. We learn low-rank models with bilinear classifiers. We also use bilinear classifiers for transfer learning by sharing linear factors between different classification tasks. Bilinear classifiers are trained with biconvex programs. Such programs are optimized with coordinate descent, where each coordinate step requires solving a convex program - in our case, we use a standard off-the-shelf SVM solver. We demonstrate bilinear SVMs on difficult problems of people detection in video sequences and action classification of video sequences, achieving state-of-the-art results in both.