Track: Spotlights

Mon 7 Dec. 18:33 - 18:34 PST

Learning to Rank by Optimizing NDCG Measure

Hamed Valizadegan · Rong Jin · Ruofei Zhang · Jianchang Mao

Learning to rank is a relatively new field of study, aiming to learn a ranking function from a set of training data with relevancy labels. The ranking algorithms are often evaluated using Information Retrieval measures, such as Normalized Discounted Cumulative Gain [1] and Mean Average Precision [2]. Until recently, most learning to rank algorithms were not using a loss function related to the above mentioned evaluation measures. The main difficulty in direct optimization of these measures is that they depend on the ranks of documents, not the numerical values output by the ranking function. We propose a probabilistic framework that addresses this challenge by optimizing the expectation of NDCG over all the possible permutations of documents. A relaxation strategy is used to approximate the average of NDCG over the space of permutation, and a bound optimization approach is proposed to make the computation efficient. Extensive experiments show that the proposed algorithm outperforms state-of-the-art ranking algorithms on several benchmark data sets.

Mon 7 Dec. 18:34 - 18:35 PST

On Learning Rotations

Raman Arora

An algorithm is presented for online learning of rotations. The proposed algorithm involves matrix exponentiated gradient updates and is motivated by the Von Neumann divergence. The additive updates are skew-symmetric matrices with trace zero which comprise the Lie algebra of the rotation group. The orthogonality and unit determinant of the matrix parameter are preserved using matrix logarithms and exponentials and the algorithm lends itself to interesting interpretations in terms of the computational topology of the compact Lie groups. The stability and the computational complexity of the algorithm are discussed.

Mon 7 Dec. 18:35 - 18:36 PST

Sparsistent Learning of Varying-coefficient Models with Structural Changes

Mladen Kolar · Le Song · Eric Xing

To estimate the changing structure of a varying-coefficient varying-structure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a sub-family of this model --- piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a two-stage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of time-varying probabilistic graphical models.

Mon 7 Dec. 18:36 - 18:37 PST

Replacing supervised classification learning by Slow Feature Analysis in spiking neural networks

Stefan Klampfl · Wolfgang Maass

Many models for computations in recurrent networks of neurons assume that the network state moves from some initial state to some fixed point attractor or limit cycle that represents the output of the computation. However experimental data show that in response to a sensory stimulus the network state moves from its initial state through a trajectory of network states and eventually returns to the initial state, without reaching an attractor or limit cycle in between. This type of network response, where salient information about external stimuli is encoded in characteristic trajectories of continuously varying network states, raises the question how a neural system could compute with such code, and arrive for example at a temporally stable classification of the external stimulus. We show that a known unsupervised learning algorithm, Slow Feature Analysis (SFA), could be an important ingredient for extracting stable information from these network trajectories. In fact, if sensory stimuli are more often followed by another stimulus from the same class than by a stimulus from another class, SFA approaches the classification capability of Fishers Linear Discriminant (FLD), a powerful algorithm for supervised learning. We apply this principle to simulated cortical microcircuits, and show that it enables readout neurons to learn discrimination of spoken digits and detection of repeating firing patterns within a stream of spike trains with the same firing statistics, without requiring any supervision for learning.

Mon 7 Dec. 18:37 - 18:38 PST

Heavy-Tailed Symmetric Stochastic Neighbor Embedding

Zhirong Yang · Irwin King · Zenglin Xu · Erkki Oja

Stochastic Neighbor Embedding (SNE) has shown to be quite promising for data visualization. Currently, the most popular implementation, t-SNE, is restricted to a particular Student t-distribution as its embedding distribution. Moreover, it uses a gradient descent algorithm that may require users to tune parameters such as the learning step size, momentum, etc., in finding its optimum. In this paper, we propose the Heavy-tailed Symmetric Stochastic Neighbor Embedding (HSSNE) method, which is a generalization of the t-SNE to accommodate various heavy-tailed embedding similarity functions. With this generalization, we are presented with two difficulties. The first is how to select the best embedding similarity among all heavy-tailed functions and the second is how to optimize the objective function once the heave-tailed function has been selected. Our contributions then are: (1) we point out that various heavy-tailed embedding similarities can be characterized by their negative score functions. Based on this finding, we present a parameterized subset of similarity functions for choosing the best tail-heaviness for HSSNE; (2) we present a fixed-point optimization algorithm that can be applied to all heavy-tailed functions and does not require the user to set any parameters; and (3) we present two empirical studies, one for unsupervised visualization showing that our optimization algorithm runs as fast and as good as the best known t-SNE implementation and the other for semi-supervised visualization showing quantitative superiority using the homogeneity measure as well as qualitative advantage in cluster separation over t-SNE.

Mon 7 Dec. 18:38 - 18:39 PST

Spatial Normalized Gamma Processes

Vinayak Rao · Yee Whye Teh

Dependent Dirichlet processes (DPs) are dependent sets of random measures, each being marginally Dirichlet process distributed. They are used in Bayesian nonparametric models when the usual exchangebility assumption does not hold. We propose a simple and general framework to construct dependent DPs by marginalizing and normalizing a single gamma process over an extended space. The result is a set of DPs, each located at a point in a space such that neighboring DPs are more dependent. We describe Markov chain Monte Carlo inference, involving the typical Gibbs sampling and three different Metropolis-Hastings proposals to speed up convergence. We report an empirical study of convergence speeds on a synthetic dataset and demonstrate an application of the model to topic modeling through time.

Mon 7 Dec. 18:39 - 18:40 PST

Rethinking LDA: Why Priors Matter

Hanna Wallach · David Mimno · Andrew McCallum

Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such "smoothing parameters" have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document-topic distributions has substantial advantages over a symmetric prior, while an asymmetric prior over the topic-word distributions provides no real benefit. Approximation of this prior structure through simple, efficient hyperparameter optimization steps is sufficient to achieve these performance gains. The prior structure we advocate substantially increases the robustness of topic models to variations in the number of topics and to the highly skewed word frequency distributions common in natural language. Since this prior structure can be implemented using efficient algorithms that add negligible cost beyond standard inference techniques, we recommend it as a new standard for topic modeling.

Mon 7 Dec. 18:40 - 18:41 PST

Nonparametric Latent Feature Models for Link Prediction

Kurt T Miller · Tom Griffiths · Michael Jordan

As the availability and importance of relational data -- such as the friendships summarized on a social networking website -- increases, it becomes increasingly important to have good models for such data. The kinds of latent structure that have been considered for use in predicting links in such networks have been relatively limited. In particular, the machine learning community has focused on latent class models, adapting nonparametric Bayesian methods to jointly infer how many latent classes there are while learning which entities belong to each class. We pursue a similar approach with a richer kind of latent variable -- latent features -- using a nonparametric Bayesian technique to simultaneously infer the number of features at the same time we learn which entities have each feature. The greater expressiveness of this approach allows us to improve link prediction on three datasets.

Mon 7 Dec. 18:41 - 18:42 PST

A Fast, Consistent Kernel Two-Sample Test

Arthur Gretton · Kenji Fukumizu · Zaid Harchaoui · Bharath Sriperumbudur

A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P=Q) an infinite weighted sum of $\chi^2$ random variables. The main result of the present work is a novel, consistent estimate of this null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q. This estimate may be computed faster than a previous consistent estimate based on the bootstrap. Another prior approach was to compute the null distribution based on fitting a parametric family with the low order moments of the test statistic: unlike the present work, this heuristic has no guarantee of being accurate or consistent. We verify the performance of our null distribution estimate on both an artificial example and on high dimensional multivariate data.

Mon 7 Dec. 18:42 - 18:43 PST

Entropic Graph Regularization in Non-Parametric Semi-Supervised Classification

Amarnag Subramanya · Jeffrey A Bilmes

We prove certain theoretical properties of a graph-regularized transductive learning objective that is based on minimizing a Kullback-Leibler divergence based loss. These include showing that the iterative alternating minimization procedure used to minimize the objective converges to the correct solution and deriving a test for convergence. We also propose a graph node ordering algorithm that is cache cognizant and leads to a linear speedup in parallel computations. This ensures that the algorithm scales to large data sets. By making use of empirical evaluation on the TIMIT and Switchboard I corpora, we show this approach is able to out-perform other state-of-the-art SSL approaches. In one instance, we solve a problem on a 120 million node graph.

Mon 7 Dec. 18:43 - 18:44 PST

Asymptotic Analysis of MAP Estimation via the Replica Method and Compressed Sensing

Sundeep Rangan · Alyson Fletcher · Vivek K Goyal

The replica method is a non-rigorous but widely-used technique from statistical physics used in the asymptotic analysis of many large random nonlinear problems. This paper applies the replica method to non-Gaussian MAP estimation. It is shown that with large random linear measurements and Gaussian noise, the asymptotic behavior of the MAP estimate of an n-dimensional vector ``decouples as n scalar MAP estimators. The result is a counterpart to Guo and Verdus replica analysis on MMSE estimation. The replica MAP analysis can be readily applied to many estimators used in compressed sensing, including basis pursuit, lasso, linear estimation with thresholding and zero-norm estimation. In the case of lasso estimation, the scalar estimator reduces to a soft-thresholding operator and for zero-norm estimation it reduces to a hard-threshold. Among other benefits, the replica method provides a computationally tractable method for exactly computing various performance metrics including MSE and sparsity recovery.

Mon 7 Dec. 18:44 - 18:45 PST

Fast Learning from Non-i.i.d. Observations

Ingo Steinwart · Andreas Christmann

We prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from $\a$-mixing processes. To illustrate this oracle inequality, we use it to derive learning rates for some learning methods including least squares SVMs. Since the proof of the oracle inequality uses recent localization ideas developed for independent and identically distributed (i.i.d.) processes, it turns out that these learning rates are close to the optimal rates known in the i.i.d. case.

Mon 7 Dec. 18:45 - 18:46 PST

Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation

Hamid R Maei · Csaba Szepesvari · Shalabh Batnaghar · Doina Precup · David Silver · Richard Sutton

We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD($\lambda$), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to nonlinear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.

Main Navigation

Session

Spotlights