NIPS 2007 Spotlights

Skip to yearly menu bar Skip to main content

Spotlight

Statistical Analysis of Semi-Supervised Regression

John Lafferty · Larry Wasserman

Dec 3, 8:10 PM - 8:25 PM

Semi-supervised methods use unlabeled data in addition to labeled data to construct predictors. While existing semi-supervised methods have shown some promising empirical performance, their development has been based largely based on heuristics. In this paper we study semi-supervised learning from the viewpoint of minimax theory. Our first result shows that some common methods based on manifold regularization and graph Laplacians do not lead to faster minimax rates of convergence. Thus, the estimators that use the unlabeled data do not have smaller risk than the estimators that use only labeled data. We then develop several new approaches that provably lead to improved performance. The statistical tools of minimax analysis are thus used to offer some new perspective on the problem of semi-supervised learning.

View full details

Spotlight

Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization

XuanLong Nguyen · Martin J Wainwright · Michael Jordan

Dec 3, 8:10 PM - 8:25 PM

We develop and analyze an algorithm for nonparametric estimation of divergence functionals and the density ratio of two probability distributions. Our method is based on a non-asymptotic variational characterization of $f$-divergences, which turns the estimation problem into a penalized convex risk minimization problem. We present a derivation of our kernel-based estimation algorithm and an analysis of convergence rates for the estimator. Our simulation results demonstrate the convergence behavior of our method, which compares favorably with existing methods in the literature.

View full details

Spotlight

Boosting the Area under the ROC Curve

Phil Long · Rocco A Servedio

Dec 3, 8:10 PM - 8:25 PM

We show that any weak ranker that can achieve an area under the ROC curve slightly better than $1/2$ (which can be achieved by random guessing) can be efficiently boosted to achieve an area under the ROC curve arbitrarily close to 1. We further show that this boosting can be performed even in the presence of independent misclassification noise, given access to a noise-tolerant weak ranker.

View full details

Spotlight

The Value of Labeled and Unlabeled Examples when the Model is Imperfect

Kaushik Sinha · Mikhail Belkin

Dec 3, 8:10 PM - 8:25 PM

Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received significant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identifiable mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in \cite{ratsaby95,castelli96a}. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, cluster'' andmanifold'' assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in \cite{ratsaby95,castelli96a} has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identifiable mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

View full details

Spotlight

A Bayesian LDA-based model for semi-supervised part-of-speech tagging

Kristina N Toutanova · Mark Johnson

Dec 3, 8:10 PM - 8:25 PM

We present a novel Bayesian statistical model for semi-supervised part-of-speech tagging. Our model extends the Latent Dirichlet Allocation (LDA) model and incorporates the intuition that words' distributions over tags, p(t|w), are sparse. In addition we introduce a model for determining the set of possible tags of a word which captures important dependencies in the ambiguity classes of words. Our model outperforms the best previously proposed model for this task on a standard dataset.

View full details

Spotlight

Resampling Methods for Protein Structure Prediction with Rosetta

Ben Blum · David Baker · Michael Jordan · Philip Bradley · Rhiju Das · David Kim

Dec 3, 8:10 PM - 8:25 PM

Rosetta is one of the leading algorithms for protein structure prediction today. It is a Monte Carlo energy minimization method requiring many random restarts to find structures with low energy. In this paper we present a resampling technique for structure prediction of small alpha/beta proteins using Rosetta. From an initial round of Rosetta sampling, we learn properties of the energy landscape that guide a subsequent round of sampling toward lower-energy structures. Rather than attempt to fit the full energy landscape, we use feature selection methods--L1-regularized linear regression--to identify structural features that give rise to low energy. We then enrich these structural features in the second sampling round. Results are presented across a benchmark set of nine small alpha/beta proteins demonstrating that our methods seldom impair, and frequently improve, Rosetta's performance.

View full details

Spotlight

Structured Learning with Approximate Inference

Alex Kulesza · Fernando Pereira

Dec 3, 8:10 PM - 8:25 PM

In many structured prediction problems, the highest-scoring labeling is hard to compute exactly, leading to the use of approximate inference methods. However, when inference is used in a learning algorithm, a good approximation of the score may not be sufficient. We show in particular that learning can fail even with an approximate inference method with rigorous approximation guarantees. There are two reasons for this. First, approximate methods can effectively reduce the expressivity of an underlying model by making it impossible to choose parameters that reliably give good predictions. Second, approximations can respond to parameter changes in such a way that standard learning algorithms are misled. In contrast, we give two positive results in the form of learning bounds for the use of LP-relaxed inference in structured perceptron and empirical risk minimization settings. We argue that without understanding combinations of inference and learning, such as these, that are appropriately compatible, learning performance under approximate inference cannot be guaranteed.

View full details

Spotlight

Convex Clustering with Exemplar-Based Models

Danial Lashkari · Polina Golland

Dec 3, 8:10 PM - 8:25 PM

Clustering is often formulated as the maximum likelihood estimation of a mixture model that explains the data. The EM algorithm widely used to solve the resulting optimization problem is inherently a gradient-descent method and is sensitive to initialization. The resulting solution is a local optimum in the neighborhood of the initial guess. This sensitivity to initialization presents a significant challenge in clustering large data sets into many clusters. In this paper, we present a different approach to approximate mixture fitting for clustering. We introduce an exemplar-based likelihood function that approximates the exact likelihood. This formulation leads to a convex minimization problem and an efficient algorithm with \emph{guaranteed convergence to the globally optimal solution}. The resulting clustering can be thought of as a probabilistic mapping of the data points to the set of exemplars that minimizes the average distance and the information-theoretic cost of mapping. We present experimental results illustrating the performance of our algorithm and its comparison with the conventional approach to mixture model clustering.

View full details

Spotlight

Regret Minimization in Games with Incomplete Information

Martin A Zinkevich · Michael Johanson · Michael Bowling · Carmelo Piccione

Dec 3, 8:10 PM - 8:25 PM

Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold'em with as many as 10^12 states, two orders of magnitude larger than previous methods.

View full details

Spotlight

Convex Learning with Invariances

Choon Hui Teo · Amir Globerson · Sam T Roweis · Alexander Smola

Dec 3, 8:10 PM - 8:25 PM

Incorporating invariances into a learning algorithm is a common problem in machine learning. We provide a convex formulation which can deal with arbitrary loss functions and arbitrary losses. In addition, it is a drop-in replacement for most optimization algorithms for kernels, including solvers of the SVMStruct family. The advantage of our setting is that it relies on column generation instead of modifying the underlying optimization problem directly.

View full details

Spotlight

Discriminative Log-Linear Grammars with Latent Variables

Slav Petrov · Dan Klein

Dec 3, 8:10 PM - 8:25 PM

We demonstrate that log-linear grammars with latent variables can be practically trained using discriminative methods. Central to efficient discriminative training is a hierarchical pruning procedure which allows feature expectations to be efficiently approximated. We investigate an EM alternating procedure as well as a direct, gradient-based procedure and show that direct optimization is superior. On full-scale treebank parsing experiments, the discriminative latent models outperform both the comparable generative latent models as well as the discriminative non-latent baselines.

View full details

Spotlight

Bayesian Policy Learning with Trans-Dimensional MCMC

Matthew Hoffman · Arnaud Doucet · Nando de Freitas · Ajay Jasra

Dec 3, 8:10 PM - 8:25 PM

A recently proposed formulation of the stochastic planning and control problem as one of parameter estimation for suitable artificial statistical models has led to the adoption of inference algorithms for this notoriously hard problem. At the algorithmic level, the focus has been on developing Expectation-Maximization (EM) algorithms. In this paper, we begin by making the crucial observation that the stochastic control problem can be reinterpreted as one of trans-dimensional inference. With this new understanding, we are able to propose a novel reversible jump Markov chain Monte Carlo (MCMC) algorithm that is more efficient than its EM counterparts. Moreover, it enables us to carry out full Bayesian policy search, without the need for gradients and with one single Markov chain. The new approach involves sampling directly from a distribution that is proportional to the reward and, consequently, performs better than classic simulations methods in situations where the reward is a rare event.

View full details

Spotlight

Learning with Transformation Invariant Kernels

Christian Walder · Olivier Chapelle

Dec 3, 8:10 PM - 8:25 PM

This paper considers kernels invariant to translation, rotation and dilation. We show that no non-trivial positive definite (p.d.) kernels exist which are radial and dilation invariant, only conditionally positive definite (c.p.d.) ones. Accordingly, we discuss the c.p.d. case and provide some novel analysis, including an elementary derivation of a c.p.d. representer theorem. On the practical side, we give a support vector machine (s.v.m.) algorithm for arbitrary c.p.d. kernels. For the thinplate kernel this leads to a classifier with only one parameter (the amount of regularisation), which we demonstrate to be as effective as an s.v.m. with the Gaussian kernel, even though the Gaussian involves a second parameter (the length scale).

View full details

Spotlight

Discovering Weakly-Interacting Factors in a Complex Stochastic Process

Charlie Frogner · Avi Pfeffer

Dec 3, 8:10 PM - 8:25 PM

Dynamic Bayesian networks are structured representations of stochastic processes. Despite their structure, exact inference in DBNs is generally intractable. One approach to approximate inference involves grouping the variables in the process into smaller factors and keeping independent beliefs over these factors. In this paper we present several techniques for decomposing a dynamic Bayesian network automatically to enable factored inference. We examine a number of features of a DBN that capture different types of dependencies that will cause error in factored inference. An empirical comparison shows that the most useful of these is a heuristic that estimates the mutual information introduced between factors by one step of belief propagation. In addition to features computed over entire factors, for efficiency we explored scores computed over pairs of variables. We present search methods that use these features, pairwise and not, to find a factorization, and compare their results on several datasets. Automatic factorization extends the applicability of factored inference to large, complex models that are undesirable to factor by hand. Moreover, tests on real DBNs show that automatic factorization can achieve significantly lower error in some cases.

View full details

Spotlight

Learning Monotonic Transformations for Classification

Andrew G Howard · Tony Jebara

Dec 3, 8:10 PM - 8:25 PM

A discriminative method is proposed for learning monotonic transformations of the training data jointly while estimating a large-margin classifier. Fixed monotonic transformations can be useful as a preprocessing step for many domains such as document classification, image histogram classification and gene microarray experiments. However, most classifiers only explore transformations through manual trial and error or via prior domain knowledge. The proposed method learns monotonic transformations automatically while training a large-margin classifier without any prior knowledge of the domain at hand. A monotonic piecewise linear function is learned which transforms data for subsequent processing by a linear hyperplane classifier. Two algorithmic implementations of the method are formalized. The first performs an alternating sequence of quadratic and linear programs to convergence until it obtains a locally optimal solution. A second algorithm is also provided using a convex semidefinite relaxation that overcomes initialization issues in the initial optimization problem. The effectiveness of these learned transformations on synthetic problems, text data and image data is demonstrated.

View full details

Spotlight

TrueSkill Through Time: Revisiting the History of Chess

Pierre Dangauthier · Ralf Herbrich · Tom Minka · Thore K Graepel

Dec 4, 9:50 AM - 10:00 AM

We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of filtering. The skill of each participating player, say, every year is represented by a latent skill variable which is affected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-specific draw margins. Based on these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players' lifetime skill development as well as the ability to compare the skills of different players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player's ability to force a draw provides significantly better predictive power.

View full details

Spotlight

Hidden Common Cause Relations in Relational Learning

Ricardo Silva · Wei Chu · Zoubin Ghahramani

Dec 4, 9:50 AM - 10:00 AM

When predicting class labels for objects within a relational database, it is often helpful to consider a model for relationships: this allows for information between class labels to be shared and to improve prediction performance. However, there are different ways by which objects can be related within a relational database. One traditional way corresponds to a Markov network structure: each existing relation is represented by an undirected edge. This encodes that, conditioned on input features, each object label is independent of other object labels given its neighbors in the graph. However, there is no reason why Markov networks should be the only representation of choice for symmetric dependence structures. Here we discuss the case when relationships are postulated to exist due to hidden common causes. We discuss how the resulting graphical model differs from Markov networks, and how it describes different types of real-world relational processes. A Bayesian nonparametric classification model is built upon this graphical representation and evaluated with several empirical studies.

View full details

Spotlight

Heterogeneous Component Analysis

Shigeyuki Oba · Motoaki Kawanabe · Klaus-Robert Müller · Shin Ishii

Dec 4, 9:50 AM - 10:00 AM

In bioinformatics it is often desirable to combine data from various measurement sources and thus structured feature vectors are to be analyzed that possess different intrinsic blocking characteristics (e.g., different patterns of missing values, observation noise levels, effective intrinsic dimensionalities). We propose a new machine learning tool, heterogeneous component analysis (HCA), for feature extraction in order to better understand the factors that underlie such complex structured heterogeneous data. HCA is a linear block-wise sparse Bayesian PCA based not only on a probabilistic model with block-wise residual variance terms but also on a Bayesian treatment of a block-wise sparse factor-loading matrix. We study various algorithms that implement our HCA concept extracting sparse heterogeneous structure by obtaining common components for the blocks and specic components within each block. Simulations on toy and bioinformatics data underline the usefulness of the proposed structured matrix factorization concept.

View full details

Spotlight

SpAM: Sparse Additive Models

Pradeep Ravikumar · Han Liu · John Lafferty · Larry Wasserman

Dec 4, 9:50 AM - 10:00 AM

We present a new class of models for high-dimensional nonparametric regression and classification called sparse additive models (SpAM). Our methods combine ideas from sparse linear modeling and additive nonparametric regression. We derive a method for fitting the models that is effective even when the number of covariates is larger than the sample size. A statistical analysis of the properties of SpAM is given together with empirical results on synthetic and real data, showing that SpAM can be effective in fitting sparse nonparametric models in high dimensional data.

View full details

Spotlight

Infinite State Bayes-Nets for Structured Domains

Max Welling · Ian Porteous · Evgeniy Bart

Dec 4, 9:50 AM - 10:00 AM

A general modeling framework is proposed that unifies nonparametric-Bayesian models, topic-models and Bayesian networks. This class of infinite state Bayes nets (ISBN) can be viewed as directed networks of hierarchical Dirichlet processes' (HDPs) where the domain of the variables can be structured (e.g. words in documents or features in images). To model the structure and to share groups between them we usecascades' of Dirichlet priors. We show that collapsed Gibbs sampling can be done efficiently in these models by leveraging the structure of the Bayes net and using the forward-filtering-backward-sampling algorithm for junction trees. Existing models, such as nested-DP, Pachinko allocation, mixed membership models etc. are described as ISBNs. Two experiments have been implemented to illustrate these ideas.

View full details

Spotlight

COFI RANK - Maximum Margin Matrix Factorization for Collaborative Ranking

Markus Weimer · Alexandros Karatzoglou · Quoc V Le · Alexander Smola

Dec 4, 9:50 AM - 10:00 AM

In this paper, we consider collaborative filtering as a ranking problem. We present a method which uses Maximum Margin Matrix Factorization and optimizes ranking instead of rating. We use structured output prediction to optimize for specific non-uniform ranking scores. Experimental results show that our method gives very good ranking scores and scales well on collaborative filtering tasks.

View full details

Spotlight

Density Estimation under Independent Similarly Distributed Sampling Assumptions

Tony Jebara · Yingbo Song · Kapil Thadani

Dec 4, 9:50 AM - 10:00 AM

A method is proposed for semiparametric estimation where parametric and nonparametric criteria are exploited in density estimation and unsupervised learning. This is accomplished by making sampling assumptions on a dataset that smoothly interpolate between the extreme of independently distributed (or {\em id}) sample data (as in nonparametric kernel density estimators) to the extreme of independent {\em identically} distributed (or {\em iid}) sample data. This article makes independent {\em similarly} distributed (or {\em isd}) sampling assumptions and interpolates between these two using a scalar parameter. The parameter controls a Bhattacharyya affinity penalty between pairs of distributions on samples. Surprisingly, the {\em isd} method maintains certain consistency and unimodality properties akin to maximum likelihood estimation. The proposed {\em isd} scheme is an alternative for handling nonstationarity in data without making drastic hidden variable assumptions which often make estimation difficult and laden with local optima. Experiments in density estimation on a variety of datasets confirm the superiority of {\em isd} over {\em iid} estimation, {\em id} estimation and mixture modeling.

View full details

Spotlight

Semi-Supervised Multitask Learning

Qiuhua Liu · Xuejun Liao · Lawrence Carin

Dec 4, 9:50 AM - 10:00 AM

A semi-supervised multitask learning (MTL) framework is presented, in which M parameterized semi-supervised classifiers, each associated with one of M partially labeled data manifolds, are learned jointly under the constraint of a soft-sharing prior imposed over the parameters of the classifiers. The unlabeled data are utilized by basing classifier learning on neighborhoods, induced by a Markov random walk over a graph representation of each manifold. Experimental results on real data sets demonstrate that semi-supervised MTL yields significant improvements in generalization performance over either semi-supervised single-task learning (STL) or supervised MTL.

View full details

Spotlight

Collective Inference on Markov Models for Modeling Bird Migration

Daniel Sheldon · M.A. Saleh Elmohamed · Dexter Kozen

Dec 4, 11:50 AM - 12:00 PM

We investigate a family of inference problems on Markov models, where many sample paths are drawn from a Markov chain and partial information is revealed to an observer who attempts to reconstruct the sample paths. We present algorithms and hardness results for several variants of this problem which arise by revealing different information to the observer and imposing different requirements for the reconstruction of sample paths. Our algorithms are analogous to the classical Viterbi algorithm for Hidden Markov Models, which finds single most probable sample path given a sequence of observations. Our work is motivated by an important application in ecology: inferring bird migration paths from a large database of observations.

View full details

Spotlight

The Generalized FITC Approximation

Andrew Naish-Guzman · Sean B Holden

Dec 4, 11:50 AM - 12:00 PM

We present an efficient generalization of the sparse pseudo-input Gaussian process (SPGP) model developed by Snelson and Ghahramani (2005), applying it to binary classification problems. By taking advantage of the SPGP prior covariance structure, we derive a numerically stable algorithm with O(NM^2) training complexity---asymptotically the same as related sparse methods such as the informative vector machine (Lawrence et al. (2003)), but which more faithfully represents the posterior. We present experimental results for several benchmark problems showing that in many cases this allows an exceptional degree of sparsity without compromising accuracy. Following Snelson and Ghahramani, we locate pseudo-inputs by gradient ascent on the marginal likelihood, but exhibit occasions when this is likely to fail, for which we suggest alternative solutions.

View full details

Spotlight

Message Passing for Max-weight Independent Set

Sujay Sanghavi · Devavrat Shah · Alan S Willsky

Dec 4, 11:50 AM - 12:00 PM

We investigate the use of message-passing algorithms for the problem of finding the max-weight independent set (MWIS) in a graph. First, we study the performance of loopy max-product belief propagation. We show that, if it converges, the quality of the estimate is closely related to the tightness of an LP relaxation of the MWIS problem. We use this relationship to obtain sufficient conditions for correctness of the estimate. We then develop a modification of max-product -- one that converges to an optimal solution of the dual of the MWIS problem. We also develop a simple iterative algorithm for estimating the max-weight independent set from this dual solution. We show that the MWIS estimate obtained using these two algorithms is conjunction correct when the graph is bipartite and the MWIS is unique. Finally, we show that any problem of MAP estimation for probability distributions over finite domains can be reduced to an MWIS problem. We believe this reduction will yield new insights and algorithms for MAP estimation.

View full details

Spotlight

Hierarchical Penalization

Marie Szafranski · Yves Grandvalet · Pierre Morizet-Mahoudeaux

Dec 4, 11:50 AM - 12:00 PM

This article presents hierarchical penalization, a generic framework for incorporating prior information in the fitting of statistical models, when the explicative variables are organized in a hierarchical structure. The penalizer, derived from an adaptive penalization formulation, is a convex functional that performs soft selection at the group level, and that shrinks variables within each group, to favor solutions with few leading terms in the final combination. The framework, originally derived for taking into account prior knowledge, is shown to be useful in kernel regression, when several parameters are used to model the influence of features.

View full details

Spotlight

Distributed Inference for Latent Dirichlet Allocation

David Newman · Arthur Asuncion · Padhraic Smyth · Max Welling

Dec 4, 11:50 AM - 12:00 PM

We investigate the problem of learning a widely-used latent-variable model -- the Latent Dirichlet Allocation (LDA) or topic model -- using distributed computation, where each of P processors only sees 1/P of the total data set. We propose two distributed inference schemes that are motivated from different perspectives. The first scheme uses local Gibbs sampling on each processor with periodic updates---it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme relies on a hierarchical Bayesian extension of the standard LDA model to directly account for the fact that data are distributed across P processors---it has a theoretical guarantee of convergence but is more complex to implement than the approximate method. Using three real-world text corpora we show that distributed learning works very well for LDA models, i.e., perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

View full details

Spotlight

Collapsed Variational Inference for HDP

Yee Whye Teh · Kenichi Kurihara · Max Welling

Dec 4, 11:50 AM - 12:00 PM

A wide variety of Dirichlet-multinomial `topic' models have found interesting applications in recent years. While Gibbs sampling remains an important method of inference in such models, variational techniques have certain advantages such as easy assessment of convergence, easy optimization without the need to maintain detailed balance, a bound on the marginal likelihood, and side-stepping of issues with topic-identifiability. The most accurate variational technique thus far, namely collapsed variational LDA (CV-LDA)\cite{TehNewmanWelling06}, did not deal with model selection nor did it include inference for the hyper-parameters. We generalize their technique to the HDP to address these issues. The result is a collapsed variational inference technique that can add topics indefinitely, for instance through split and merge heuristics, while the algorithm will automatically remove clusters which are not supported by the data. Experiments show a very significant improvement in accuracy relative to CV-LDA.

View full details

Spotlight

Catching Up Faster in Bayesian Model Selection and Model Averaging

Tim van Erven · Peter Grünwald · Steven de Rooij

Dec 4, 11:50 AM - 12:00 PM

Bayesian model averaging, model selection and their approximations such as BIC are generally statistically consistent, but sometimes achieve slower rates of convergence than other methods such as AIC and leave-one-out cross-validation. On the other hand, these other methods can be inconsistent. We identify the "catch-up phenomenon" as a novel explanation for the slow convergence of Bayesian methods. Based on this analysis we define the switch-distribution, a modification of the Bayesian marginal distribution. We prove that in many situations model selection and prediction based on the switch-distribution is both consistent and achieves optimal convergence rates, thereby resolving the AIC/BIC dilemma. The method is practical; we give an efficient implementation.

View full details

Spotlight

Privacy-Preserving Belief Propagation and Sampling

Michael Kearns · Jinsong Tan · Jennifer Wortman Vaughan

Dec 4, 11:50 AM - 12:00 PM

We provide provably privacy-preserving versions of belief propagation, Gibbs sampling, and other local algorithms --- distributed multiparty protocols in which each party or vertex learns only its final local value, and absolutely nothing else.

View full details

Spotlight

Iterative Non-linear Dimensionality Reduction with Manifold Sculpting

Michael S Gashler · Dan Ventura · Tony Martinez

Dec 4, 3:20 PM - 3:30 PM

Many algorithms have been recently developed for reducing dimensionality by projecting data onto an intrinsic non-linear manifold. Unfortunately, existing algorithms often lose significant precision in this transformation. Manifold Sculpting is a new algorithm that iteratively reduces dimensionality by simulating surface tension in local neighborhoods. We present several experiments that show Manifold Sculpting yields more accurate results than existing algorithms with both generated and natural data-sets. Manifold Sculpting is also able to benefit from both prior dimensionality reduction efforts.

View full details

Spotlight

McRank: Learning to Rank Using Multiple Classification and Gradient Boosting

Ping Li · Chris J Burges · Qiang Wu

Dec 4, 3:20 PM - 3:30 PM

We cast the ranking problem as (1) multiple classification (``Mc'') (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the {\em Expected Relevance} to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve {\em LambdaRank}\cite{Proc:BurgesNIPS06} and the regressions-based ranker\cite{Proc:ZhangCOLT06}, in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented.

View full details

Spotlight

Consistent Minimization of Clustering Objective Functions

Ulrike von Luxburg · Sebastien Bubeck · Stefanie S Jegelka · Michael Kaufmann

Dec 4, 3:20 PM - 8:25 PM

Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of ``nearest neighbor clustering''. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound.

View full details

Spotlight

Boosting Algorithms for Maximizing the Soft Margin

Manfred K. Warmuth · Karen Glocer · Gunnar Rätsch

Dec 4, 3:20 PM - 3:30 PM

We present a novel boosting algorithm, called Softboost, designed for sets of binary labeled examples that are not necessarily separable by convex combinations of base hypotheses. Our algorithm aims to achieve robustness by \emph{capping} the distributions on the examples. Our update of the distribution is motivated by minimizing a relative entropy subject to the capping constraints and constraints on the edges of the obtained base hypotheses. The capping constraints imply a soft margin in the dual optimization problem and our algorithm produces a convex combination of hypotheses whose \emph{soft margin} is within $\delta$ of the optimum. We employ relative entropy projection methods to prove an $O(\frac{\ln N}{\delta^2})$ iteration bound for our algorithm, where $N$ is number of examples.

View full details

Spotlight

Classification via Minimum Incremental Coding Length (MICL)

John Wright · Yangyu Tao · Zhouchen Lin · Yi Ma · Heung-Yeung Shum

Dec 4, 3:20 PM - 3:30 PM

We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We rigorously prove asymptotic optimality of this criterion for Gaussian (normal) distributions and analyze its relationships to classical classifiers. The theoretical results provide new insights into the relationships among a variety of popular classifiers such as MAP, RDA, k-NN, and SVM, as well as unsupervised methods based on lossy coding \cite{Ma2007-PAMI}. Our formulation induces several good effects on the resulting classifier. First, minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small sample setting. Second, compression provides a uniform means of handling classes of varying dimension. The new criterion and its kernel and local versions perform competitively on synthetic examples, as well as on real imagery data such as handwritten digits and face images. On these problems, the performance of our simple classifier approaches the best reported results, without using domain-specific information. All MATLAB code and classification results will be made publicly available for peer evaluation.

View full details

Spotlight

Random Features for Large-Scale Kernel Machines

ali rahimi · Benjamin Recht

Dec 4, 3:20 PM - 3:30 PM

To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift-invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines.

View full details

Spotlight

Bundle Methods for Machine Learning

Alexander Smola · Vishwanathan S V N · Quoc V Le

Dec 4, 3:20 PM - 3:30 PM

We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in $O(1/\epsilon)$ steps to $\epsilon$ precision for general convex problems and in $O(\log \epsilon)$ steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach.

View full details

Spotlight

Anytime Induction of Cost-sensitive Trees

Saher Esmeir · Shaul Markovitch

Dec 4, 3:20 PM - 3:30 PM

Machine learning techniques are increasingly being used to produce a wide-range of classifiers for complex real-world applications that involve nonuniform testing costs and misclassification costs. As the complexity of these applications grows, the management of resources during the learning and classification processes becomes a challenging task. In this work we introduce ACT Anytime Cost-sensitive Trees), a novel framework for operating in such environments. ACT is an anytime algorithm that allows trading computation time for lower classification costs. It builds a tree top-down and exploits additional time resources to obtain better estimations for the utility of the different candidate splits. Using sampling techniques ACT approximates for each candidate split the cost of the subtree under it and favors the one with a minimal cost. Due to its stochastic nature ACT is expected to be able to escape local minima, into which greedy methods may be trapped. Experiments with a variety of datasets were conducted to compare the performance of ACT to that of the state of the art cost-sensitive tree learners. The results show that for most domains ACT produces trees of significantly lower costs. ACT is also shown to exhibit good anytime behavior with diminishing returns.

View full details

Spotlight

A Kernel Statistical Test of Independence

Arthur Gretton · Kenji Fukumizu · Choon Hui Teo · Le Song · Bernhard Schölkopf · Alexander Smola

Dec 4, 3:20 PM - 3:30 PM

Whereas kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m^2), where m is the sample size. We demonstrate that this test outperforms established contingency table-based tests. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.

View full details

Spotlight

Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

Gerald Tesauro · Rajarshi Das · Hoi Chan · Jeffrey O Kephart · David Levine · Freeman Rawson · Charles Lefurgy

Dec 4, 5:20 PM - 5:30 PM

Electrical power management in large-scale IT systems such as commercial datacenters is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO$_2$ emissions at stake annually. Businesses want to save power without sacrificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynamically varying HTTP workload running on a commercial web applications middleware platform. We embed a CPU frequency controller in the Blade servers' firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, including multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious ``cookbook'' RL implementations.

View full details

Spotlight

Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion

Zico Kolter · Pieter Abbeel · Andrew Y Ng

Dec 4, 5:20 PM - 5:30 PM

We consider apprenticeship learning --- learning from expert demonstrations --- in the setting of large, complex domains. Past work in apprenticeship learning requires that the expert demonstrate complete trajectories through the domain, but in many problems where even an expert has difficulty controlling the system, this is infeasible. For example, consider the task of teaching a quadruped robot to navigate over extreme terrain; demonstrating an optimal policy (i.e., an optimal set of foot locations over the entire terrain) is a highly non-trivial task, even for an expert. In this paper we propose a method for hierarchical apprenticeship learning, which allows the algorithm to accept isolated advice at different hierarchical levels of the control task. This type of advice is often feasible for experts to give, even if the expert is unable to demonstrate full trajectories. This thus allows us to extend the apprenticeship learning paradigm to much larger, more challenging domains. In particular, in this paper we apply the hierarchical apprenticeship learning algorithm to the task of quadruped locomotion over extreme terrain, and achieve, to the best of our knowledge, results superior to any previously published work.

View full details

Spotlight

What makes some POMDP problems easy to approximate?

David Hsu · Wee Sun Lee · Nan Rong

Dec 4, 5:20 PM - 5:30 PM

Point-based algorithms have been surprisingly successful in computing approximately optimal policies for partially observable Markov decision processes (POMDPs) in high dimensional belief spaces. In this work, we seek to understand the belief-space properties that allow some POMDP problems to be approximated efficiently and thus help to explain the point-based algorithms' success often observed in the experiments. We show that an approximately optimal POMDP solution can be computed in time polynomial in the covering number of a reachable belief space, the subset of the belief space reachable from a given belief point. We also show that under the weaker condition of having a small covering number for an optimal reachable space, the subset of the belief space reachable under an optimal policy, computing an approximately optimal solution is NP-hard. However, given a set of points from an optimal reachable space that covers it well, an approximate solution can be computed in polynomial time. The covering number highlights several interesting properties that help reduce the complexity of POMDP problems in practice, such as fully observed state variables, beliefs with sparse support, smooth beliefs, and circulant state-transition matrices.

View full details

Spotlight

Bayes-Adaptive POMDPs

Stephane Ross · Brahim Chaib-draa · Joelle Pineau

Dec 4, 5:20 PM - 5:30 PM

Bayesian Reinforcement Learning has generated substantial interest recently, as it provides an elegant solution to the exploration-exploitation trade-off in reinforcement learning. However most investigations of Bayesian reinforcement learning to date focus on the standard Markov Decision Processes (MDPs). Our goal is to extend these ideas to the more general Partially Observable MDP (POMDP) framework, where the state is a hidden variable. To address this problem, we introduce a new mathematical model, the Bayes-Adaptive POMDP. This new model allows one to (1) improve knowledge of the POMDP domain through interaction with the environment, and (2) plan optimal sequences of actions which can trade-off between improving the model, identifying the state, and gathering reward. We show how the model can be finitely approximated while preserving the value function. We describe approximations for belief tracking and planning in this model. Empirical results on two domains show that the model estimate and agent's return improve over time, as the agent learns better model estimates.

View full details

Spotlight

Incremental Natural Actor-Critic Algorithms

Shalabh Bhatnagar · Richard Sutton · Mohammad Ghavamzadeh · Mark P Lee

Dec 4, 5:20 PM - 5:30 PM

We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradient in this way are of special interest because of their compatibility with function approximation methods, which are needed to handle large or infinite state spaces, and the use of temporal difference learning in this way is of interest because in many applications it dramatically reduces the variance of the policy gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda et al.\ by using temporal difference learning in the actor and by incorporating natural gradients, and extend prior empirical studies of natural-gradient actor-critic methods by Peters et al.\ by providing the first convergence proofs and the first fully incremental algorithms.

View full details

Spotlight

A general agnostic active learning algorithm

Sanjoy Dasgupta · Daniel Hsu · Claire Monteleoni

Dec 4, 5:20 PM - 5:30 PM

We present an agnostic active learning algorithm for any hypothesis class of bounded VC dimension under arbitrary data distributions. Most previous work on active learning either makes strong distributional assumptions, or else is computationally prohibitive. Our algorithm extends the simple scheme of Cohn, Atlas, and Ladner to the agnostic setting, using reductions to supervised learning that harness generalization bounds in a simple but subtle manner. We provide a fall-back guarantee that bounds the algorithm's label complexity by the agnostic PAC sample complexity. Our analysis yields asymptotic label complexity improvements for certain hypothesis classes and distributions. We also demonstrate improvements experimentally.

View full details

Spotlight

Stable Dual Dynamic Programming

Tao Wang · Daniel Lizotte · Michael Bowling · Dale Schuurmans

Dec 4, 5:20 PM - 5:30 PM

Recently, a novel approach to dynamic programming and reinforcement learning has been proposed based on maintaining explicit representations of stationary distributions instead of value functions. The convergence properties and practical effectiveness of these algorithms have not been previously studied however. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation.

View full details

Spotlight

Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods

Alessandro Lazaric · Marcello Restelli · Andrea Bonarini

Dec 4, 5:20 PM - 5:30 PM

Learning in real-world domains often requires to deal with continuous state and action spaces. Although many solutions have been proposed to apply Reinforcement Learning algorithms to continuous state problems, the same techniques can be hardly extended to continuous action spaces, where, besides the computation of a good approximation of the value function, a fast method for the identification of the highest-valued action is needed. In this paper, we propose a novel actor-critic approach in which the policy of the actor is estimated through sequential Monte Carlo methods. The importance sampling step is performed on the basis of the values learned by the critic, while the resampling step modifies the actor's policy. The proposed approach has been empirically compared to other learning algorithms into several domains; in this paper, we report results obtained in a control problem consisting of steering a boat across a river.

View full details

Spotlight

Selecting Observations against Adversarial Objectives

Andreas Krause · H. Brendan McMahan · Carlos Guestrin · Anupam Gupta

Dec 4, 5:20 PM - 5:30 PM

In many applications, one has to actively select among a set of expensive observations before making an informed decision. Often, we want to select observations which perform well when evaluated with an objective function chosen by an adversary. Examples include minimizing the maximum posterior variance in Gaussian Process regression, robust experimental design, and sensor placement for outbreak detection. In this paper, we present the Submodular Saturation algorithm, a simple and efficient algorithm with strong theoretical approximation guarantees for the case where the possible objective functions exhibit submodularity, an intuitive diminishing returns property. Moreover, we prove that better approximation algorithms do not exist unless NP-complete problems admit efficient algorithms. We evaluate our algorithm on several real-world problems. For Gaussian Process regression, our algorithm compares favorably with state-of-the-art heuristics described in the geostatistics literature, while being simpler, faster and providing theoretical guarantees. For robust experimental design, our algorithm performs well compared to SDP-based algorithms.

View full details

Spotlight

Combined discriminative and generative articulated pose and non-rigid shape estimation

Leonid Sigal · Alexandru Balan · Michael J Black

Dec 5, 9:50 AM - 10:00 AM

Estimation of three-dimensional articulated human pose and motion from images is a central problem in computer vision. Much of the previous work has been limited by the use of crude generative models of humans represented as articulated collections of simple parts such as cylinders. Automatic initialization of such models has proved difficult and most approaches assume that the size and shape of the body parts are known a priori. In this paper we propose a method for automatically recovering a detailed parametric model of non-rigid body shape and pose from monocular imagery. Specifically, we represent the body using a parameterized triangulated mesh model that is learned from a database of human range scans. We demonstrate a discriminative method to directly recover the model parameters from monocular images using a mixture of regressors. This predicted pose and shape are used to initialize a generative model for more detailed pose and shape estimation. The resulting approach allows fully automatic pose and shape recovery from monocular and multi-camera imagery. Experimental results show that our method is capable of robustly recovering articulated pose, shape and biometric measurements (e.g. height, weight, etc.) in both calibrated and uncalibrated camera environments.

View full details

Spotlight

Cooled and Relaxed Survey Propagation for MRFs

Hai Leong Chieu · Wee Sun Lee · Yee Whye Teh

Dec 5, 9:50 AM - 10:00 AM

We describe a new algorithm, Relaxed Survey Propagation (RSP), for finding MAP configurations in Markov random fields. We compare its performance with state-of-the-art algorithms including the max-product belief propagation, its sequential tree-reweighted variant, residual (sum-product) belief propagation, and tree-structured expectation propagation. We show that it outperforms all approaches for Ising models with mixed couplings, as well as on a web person disambiguation task formulated as a supervised clustering problem.

View full details

Spotlight

Multi-task Gaussian Process Prediction

Edwin Bonilla · Kian Ming A Chai · Chris Williams

Dec 5, 9:50 AM - 10:00 AM

In this paper we investigate multi-task learning in the context of Gaussian Processes (GP). We propose a model that learns a shared covariance function on input-dependent features and a ``free-form'' covariance matrix over tasks. This allows for good flexibility when modelling inter-task dependencies while avoiding the need for large amounts of data for training. We show that under the assumption of noise-free observations and block design, predictions for a given task only depend on its target values and therefore a cancellation of inter-task transfer occurs. We evaluate the benefits of our model on two practical applications: a compiler performance prediction problem and an exam score prediction task. Additionally, we make use of GP approximations and properties of our model in order to provide scalability to large data sets.

View full details

Spotlight

Agreement-Based Learning

Percy Liang · Dan Klein · Michael Jordan

Dec 5, 9:50 AM - 10:00 AM

The learning of probabilistic models with many hidden variables and non-decomposable dependencies is an important but challenging problem. In contrast to traditional approaches based on approximate inference in a single intractable model, our approach is to train a set of tractable component models by encouraging them to agree on the hidden variables. This allows us to capture non-decomposable aspects of the data while still maintaining tractability. We exhibit an objective function for our approach, derive EM-style algorithms for parameter estimation, and demonstrate their effectiveness on three challenging real-world learning tasks.

View full details

Spotlight

Efficient Bayesian Inference for Dynamically Changing Graphs

Ozgur Sumer · Umut A Acar · Alexander T. Ihler · Ramgopal R. Mettu

Dec 5, 9:50 AM - 10:00 AM

Motivated by stochastic systems in which observed evidence and conditional dependencies between states of the network change over time, and certain quantities of interest (marginal distributions, likelihood estimates etc.) must be updated, we study the problem of dynamic inference in tree-structured Bayesian networks. We describe an algorithm for dynamic inference that handles a broad range of changes to the network and is able to maintain marginal distributions, MAP estimates, and data likelihoods in all expected logarithmic time. We give an implementation of our algorithm and provide experiments that show that the algorithm can yield up to two orders of magnitude speedups on answering queries and responding to dynamic changes over the sum-product algorithm.

View full details

Spotlight

Augmented Functional Time Series Representation and Forecasting with Gaussian Processes

Nicolas Chapados · Yoshua Bengio

Dec 5, 9:50 AM - 10:00 AM

We introduce a functional representation of time series which allows forecasts to be performed over an unspecified horizon with progressively-revealed information sets. By virtue of using Gaussian processes, a complete covariance matrix between forecasts at several time-steps is available. This information is put to use in an application to actively trade price spreads between commodity futures contracts. The approach delivers impressive out-of-sample risk-adjusted returns after transaction costs on a portfolio of 30 spreads.

View full details

Spotlight

Bayesian Multi-View Learning

Shipeng Yu · Balaji R Krishnapuram · Romer E Rosales · Harald Steck · R. Bharat Rao

Dec 5, 9:50 AM - 10:00 AM

We propose a Bayesian undirected graphical model for semi-supervised multi-view learning (BMVL). This model makes explicit the previously unstated assumptions of a large class of semi-supervised multi-view learning algorithms, and clarifies the circumstances under which these assumptions fail. Building upon new insights from this model, we propose an improved method for BMVL: in particular, we derive a novel co-training kernel for Gaussian Process classifiers. Unlike some previous methods for multi-view learning, the resulting approach is convex and avoids local-maxima problems. Further, it automatically estimates how much each view should be trusted, and thus accommodates noisy or unreliable views. Experiments show that the approach is more accurate than previous multi-view learning algorithms.

View full details

Spotlight

An Analysis of Inference with the Universum

Fabian H Sinz · Olivier Chapelle · Alekh Agarwal · Bernhard Schölkopf

Dec 5, 9:50 AM - 10:00 AM

We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a projected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results.

View full details

Spotlight

Simulated Annealing: Rigorous finite-time guarantees for optimization on continuous domains

Andrea Lecchini-Visintini · John Lygeros · Jan Maciejowski

Dec 5, 9:50 AM - 10:00 AM

Simulated annealing is a popular method for approaching the solution of a global optimization problem. Existing results on its performance apply to discrete combinatorial optimization where the optimization variables can assume only a finite set of possible values. We introduce a new general formulation of simulated annealing which allows one to guarantee finite-time performance in the optimization of functions of continuous variables. The results hold universally for any optimization problem on a bounded domain and establish a connection between simulated annealing and up-to-date theory of convergence of Markov chain Monte Carlo methods on continuous domains. This work is inspired by the concept of finite-time learning with known accuracy and confidence developed in statistical learning theory.

View full details

Spotlight

Expectation Maximization, Posterior Constraints, and Statistical Alignment

Kuzman Ganchev · Joao V Graca · Ben Taskar

Dec 5, 9:50 AM - 10:00 AM

The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are hidden. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is often very indirect or difficult to add even simple a-priori information about latent variables without making models overly complex or intractable. In this paper, we present an efficient, principled way to inject constraints on the posteriors of latent variables into the EM algorithm. Our method can be viewed as a regularization of the posteriors of hidden variables, or alternatively as a restriction on the types of lower bounds used for maximizing data likelihood. Focusing on the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models.

View full details

Spotlight

Experience-Guided Search: A Theory of Attentional Control

Michael Mozer · David Baldwin

Dec 5, 11:50 AM - 12:00 PM

People perform a remarkable range of tasks that require search of the visual environment for a target item among distractors. The Guided Search model (Wolfe, 1994, 2007), or GS, is perhaps the best developed psychological account of human visual search. To prioritize search, GS assigns saliency to locations in the visual field. Saliency is a linear combination of activations from retinotopic maps representing primitive visual features. GS includes heuristics for setting the gain coefficient associated with each map. Variants of GS have formalized the notion of optimization as a principle of attentional control (e.g., Baldwin \& Mozer, 2006; Navalpakkam \& Itti, 2006; Rao et al., 2002), but every GS-like model must be 'dumbed down' to match human data, e.g., by corrupting the saliency map with noise and by imposing arbitrary restrictions on gain modulation. We propose a principled probabilistic formulation of GS, called Experience-Guided Search (EGS), based on a generative model of the environment that makes three claims: (1) Feature detectors produce Poisson spike trains whose rates are conditioned on feature type and whether the feature belongs to a target or distractor; (2) the environment and/or task is nonstationary and can change over a sequence of trials; and (3) a prior specifies that features are more likely to be present for target than for distractors. Through experience, EGS infers latent environment variables that determine the gains for guiding search. Control is thus cast as probabilistic inference, not optimization. We show that EGS can replicate a range of human data from visual search, including data that GS does not address.

View full details

Spotlight

Ensemble Clustering using Semidefinite Programming

Vikas Singh · Lopamudra Mukherjee · Jiming Peng · Jinhui Xu

Dec 5, 11:50 AM - 12:00 PM

We consider the ensemble clustering problem where the task is to `aggregate' multiple clustering solutions into a single consolidated clustering that maximizes the shared information among given clustering solutions. We obtain several new results for this problem. First, we note that the notion of agreement under such circumstances can be better captured using an agreement measure based on a $2D$ string encoding rather than voting strategy based methods proposed in literature. Using this generalization, we first derive a nonlinear optimization model to maximize the new agreement measure. We then show that our optimization problem can be transformed into a strict $0$-$1$ Semidefinite Program (SDP) via novel convexification techniques which can subsequently be relaxed to a polynomial time solvable SDP. Our experiments indicate improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. We discuss extensive evaluations of the algorithm on clustering and image segmentation databases.

View full details

Spotlight

How SVMs can estimate quantiles and the median

Andreas Christmann · Ingo Steinwart

Dec 5, 11:50 AM - 12:00 PM

We investigate kernel-based quantile regression based on the pinball loss and support vector regression based on the eps-insensitive loss. Conditions are given which quarantee that the set of exact minimizers contains only one function. Some results about oracle inequalities and learning rates of these methods are presented.

View full details

Spotlight

Transfer Learning using Kolmogorov Complexity: Basic Theory and Empirical Evaluations

M. M. H Mahmud · Sylvian Ray

Dec 5, 11:50 AM - 12:00 PM

In transfer learning we aim to solve new problems quicker by using information gained from solving related problems. Transfer learning has been successful in practice, and extensive PAC analysis of these methods has been developed. However it is not yet clear how to define relatedness between tasks. This is considered as a major problem as, aside from being conceptually troubling, it makes it unclear how much information to transfer and when and how to transfer it. In this paper we propose to measure the amount of information one task contains about another using conditional Kolmogorov complexity between the tasks. We show how existing theory neatly solves the problem of measuring relatedness and transferring the `right' amount of information in sequential transfer learning in a Bayesian setting. The theory also suggests that, in a very formal and precise sense, no other transfer method can do much better than the Kolmogorov Complexity theoretic transfer method, and that sequential transfer is always justified. We also develop a practical approximation to the method and use it to transfer information between $9$ arbitrarily chosen databases from the UCI ML repository.

View full details

Spotlight

Better than least squares: comparison of objective functions for estimating linear-nonlinear models

Tatyana Sharpee

Dec 5, 11:50 AM - 12:00 PM

This paper compares a family of methods for characterizing neural feature selectivity with natural stimuli in the framework of the linear-nonlinear model. In this model, the neural firing rate is a nonlinear function of a small number of relevant stimulus components. The relevant stimulus dimensions can be found by maximizing one of the family of objective functions, R\'enyi divergences of different orders. We show that maximizing one of them, R\'enyi divergence of order 2, is equivalent to least-square fitting of the linear-nonlinear model to neural data. Next, we derive reconstruction errors in relevant dimensions found by maximizing R\'enyi divergences of arbitrary order in the asymptotic limit of large spike numbers. We find that the smallest errors are obtained with R\'enyi divergence of order 1, also known as Kullback-Leibler divergence. This corresponds to finding relevant dimensions by maximizing mutual information. Finally, we numerically test how these optimization schemes perform in the regime of low signal-to-noise ratio (small number of spikes and increasing neural noise) for model visual neurons. We find that optimization schemes based on either least square fitting or information maximization perform well even when number of spikes is small. Information maximization provides slightly, but significantly, better reconstructions than least square fitting. This makes the problem of finding relevant dimensions one of the examples where information-theoretic measures are no more data limited than those derived from least squares.

View full details

Spotlight

One-Pass Boosting

Zafer Barutcuoglu · Phil Long · Rocco A Servedio

Dec 5, 11:50 AM - 12:00 PM

This paper studies boosting algorithms that make a single pass over a set of base classifiers. We first analyze a one-pass algorithm in the setting of boosting with diverse base classifiers. Our guarantee is the same as the best proved for any boosting algorithm, but our one-pass algorithm is much faster than previous approaches. We next exhibit a random source of examples for which a ``picky'' variant of AdaBoost that skips poor base classifiers can outperform the standard AdaBoost algorithm, which uses every base classifier, by an exponential factor. Experiments with Reuters and synthetic data show that one-pass boosting can substantially improve on the accuracy of Naive Bayes, and that picky boosting can sometimes lead to a further improvement in accuracy.

View full details

Spotlight

Nearest-Neighbor-Based Active Learning for Rare Category Detection

Jingrui He · Jaime Carbonell

Dec 5, 11:50 AM - 12:00 PM

Rare category detection is an open challenge for active learning, especially in the de-novo case (no labeled examples), but of significant practical importance for data mining - e.g. detecting new financial transaction fraud patterns, where normal legitimate transactions dominate. This paper develops a new method for detecting an instance of each minority class via an unsupervised local-density-differential sampling strategy. Essentially a variable-scale nearest neighbor process is used to optimize the probability of sampling tightly-grouped minority classes, subject to a local smoothness assumption of the majority class. Results on both synthetic and real data sets are very positive, detecting each minority class with only a fraction of the actively sampled points required by random sampling and by Pelleg's Interleave method, the prior best technique in the sparse literature on this topic.

View full details

Spotlight

A Spectral Regularization Framework for Multi-Task Structure Learning

Andreas Argyriou · Charles A. Micchelli · Massimiliano Pontil · Yiming Ying

Dec 5, 11:50 AM - 12:00 PM

Learning the common structure shared by a set of supervised tasks is an important practical and theoretical problem. Knowledge of this structure may lead to better generalization performance on the tasks and may also facilitate learning new tasks. We propose a framework for solving this problem, which is based on regularization with spectral functions of matrices. This class of regularization problems exhibits appealing computational properties and can be optimized efficiently by an alternating minimization algorithm. In addition, we provide a necessary and sufficient condition for convexity of the regularizer. We analyze concrete examples of the framework, which are equivalent to regularization with L_p matrix norms. Experiments on two real data sets indicate that the algorithm scales well with the number of tasks and improves on state of the art statistical performance.

View full details

Spotlight

Kernel Measures of Conditional Dependence

Kenji Fukumizu · Arthur Gretton · Xiaohai Sun · Bernhard Schölkopf

Dec 5, 11:50 AM - 12:00 PM

We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not depend on the choice of kernel in the limit of infinite data, for a wide class of kernels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments.

View full details

Spotlight

A Bayesian Framework for Cross-Situational Word-Learning

Michael C Frank · Noah Goodman · Josh Tenenbaum

Dec 5, 3:20 PM - 3:30 PM

For infants, early word learning is a chicken-and-egg problem. One way to learn a word is to observe that it co-occurs with a particular referent across different situations. Another way is to use the social context of an utterance to infer the intended referent of a word. Here we present a Bayesian model of cross-situational word learning, and an extension of this model that also learns which social cues are relevant to determining reference. We test our model on a small corpus of mother-infant interaction and find it performs better than competing models. Finally, we show that our model accounts for experimental phenomena including mutual exclusivity, fast-mapping, and generalization from social cues.

View full details

Spotlight

Congruence between model and human attention reveals unique signatures of critical visual events

Robert J Peters · Laurent Itti

Dec 5, 3:20 PM - 3:30 PM

Current computational models of bottom-up and top-down components of attention are predictive of eye movements across a range of stimuli and of simple, fixed visual tasks (such as visual search for a target among distractors). However, to date there exists no computational framework which can reliably mimic human gaze behavior in more complex environments and tasks, such as driving a vehicle through traffic. Here, we develop a hybrid computational/behavioral framework, combining simple models for bottom-up salience and top-down relevance, and looking for changes in the predictive power of these components at different critical event times during 4.7 hours (500,000 video frames) of observers playing car racing and flight combat video games. This approach is motivated by our observation that the predictive strengths of the salience and relevance models exhibit reliable temporal signatures during critical event windows in the task sequence---for example, when the game player directly engages an enemy plane in a flight combat game, the predictive strength of the salience model increases significantly, while that of the relevance model decreases significantly. Our new framework combines these temporal signatures to implement several event detectors. Critically, we find that an event detector based on fused behavioral and stimulus information (in the form of the model's predictive strength) is much stronger than detectors based on behavioral information alone (eye position) or image information alone (model saliency maps). This approach to event detection, based on eye tracking combined with computational models applied to the visual input, may have useful applications as a less-invasive alternative to other event detection approaches based on neural signatures derived from EEG or fMRI recordings.

View full details

Spotlight

Learning Visual Attributes

Vittorio Ferrari · Andrew Zisserman

Dec 5, 3:20 PM - 3:30 PM

We present a probabilistic generative model of visual attributes, together with an efficient learning algorithm. Attributes are visual qualities of objects, such as red',striped', or `spotted'. The model sees attributes as patterns of image segments, repeatedly sharing some characteristic properties. These can be any combination of appearance, shape, or the layout of segments within the pattern. Moreover, attributes with general appearance are taken into account, such as the pattern of alternation of {\em any} two colors which is characteristic for stripes. To enable learning from unsegmented training images, the model is learnt discriminatively, by optimizing a likelihood ratio. As demonstrated in the experimental evaluation, our model can learn in a weakly supervised setting and encompasses a broad range of attributes. We show that attributes can be learnt starting from a text query to Google image search, and can then be used to recognize the attribute and determine its spatial extent in novel real-world images.

View full details

Spotlight

Subspace-Based Face Recognition in Analog VLSI

Miguel E Figueroa · Gonzalo Carvajal · Waldo Valenzuela

Dec 5, 3:20 PM - 3:30 PM

We describe an analog-VLSI neural network for face recognition based on subspace methods. The system uses a dimensionality-reduction network whose coefficients can be either programmed or learned on-chip to perform PCA, or programmed to perform LDA. A second network with user-programmed coefficients performs classification with Manhattan distances. The system uses on-chip compensation techniques to reduce the effects of device mismatch. Using the ORL database with 12x12-pixel images, our circuit achieves up to 85\% classification performance (98\% of an equivalent software implementation).

View full details

Spotlight

Comparing Bayesian models for multisensory cue combination without mandatory integration

Ulrik Beierholm · Konrad P Kording · Ladan Shams · Wei Ji Ma

Dec 5, 3:20 PM - 3:30 PM

Bayesian models of multisensory perception traditionally address the problem of estimating a variable that is assumed to be the underlying cause of two sensory signals. The brain, however, has to solve a more general problem: it has to establish which signals come from the same source and should be integrated, and which ones do not and should be segregated. In the last couple of years, a few models have been proposed to solve this problem in a Bayesian fashion. One of these has the strength that it formalizes the causal structure of sensory signals. We describe these models and conduct an experiment to test human performance in an auditory-visual spatial localization task in which integration is not mandatory. We find that the causal Bayesian inference model accounts for the data better than other models.

View full details

Spotlight

Object Recognition by Scene Alignment

Bryan C Russell · Antonio Torralba · Ce Liu · Rob Fergus · William Freeman

Dec 5, 3:20 PM - 3:30 PM

Current object recognition systems can only recognize a limited number of object categories; scaling up to many categories is the next challenge in object recognition. We seek to build a system to recognize and localize many different object categories in complex scenes. We achieve this through a deceptively simple approach: by matching the input image, in an appropriate representation, to images in a large training set of labeled images (LabelMe). This provides us with a set of retrieval images, providing hypotheses for object identities and locations. We then transfer the labelings from the retrieval set. We demonstrate the effectiveness of this approach and study algorithm component contributions using held-out test sets from the LabelMe database.

View full details

Spotlight

Theoretical Analysis of Learning with Reward-Modulated Spike-Timing-Dependent Plasticity

Robert Legenstein · Dejan Pecevski · Wolfgang Maass

Dec 5, 3:20 PM - 3:30 PM

Reward-modulated spike-timing-dependent plasticity (STDP) has recently emerged as a candidate for a learning rule that could explain how local learning rules at single synapses support adaptive changes in complex networks of spiking neurons. However the potential and limitations of this learning rule could so far only been tested through computer simulations. This article provides tools for an analytic treatment of reward-modulated STDP, which allow us to derive concrete conditions under which the convergence of reward-modulated STDP can be predicted. In particular, we can produce in this way a theoretical explanation and a computer model for a fundamental experimental finding on reinforcement learning in monkeys by Fetz and Baker. We also report results of computer simulations that have tested further predictions of this theory.

View full details

Spotlight

Sequential Hypothesis Testing under Stochastic Deadlines

Peter Frazier · Angela Yu

Dec 5, 3:20 PM - 3:30 PM

Most models of decision-making in neuroscience assume an infinite horizon, which yields an optimal solution that integrates evidence up to a fixed decision threshold. However, under most experimental as well as naturalistic behavioral settings, the decision has to be made before some finite deadline, which is often experienced as a stochastic quantity, either due to variable external constraints or internal timing uncertainty. In this work, we formulate this problem as sequential hypothesis testing under a stochastic horizon. We use dynamic programming tools to show that, for a large class of deadline distributions, the Bayes-optimal solution requires integrating evidence up to a threshold that declines monotonically over time. We will use numerical simulations to illustrate the optimal policy in the special cases of a fixed deadline and one that is drawn from a gamma distribution.

View full details

Spotlight

Retrieved context and the discovery of semantic structure

Vinayak Rao · Marc Howard

Dec 5, 3:20 PM - 3:30 PM

Semantic memory refers to our knowledge of facts and relationships between concepts. A successful semantic memory depends on inferring relationships between items that are not explicitly taught. Recent mathematical modeling of episodic memory argues that episodic recall relies on retrieval of a gradually-changing representation of temporal context. We show that retrieved context enables the development of a global memory space that reflects relationships between all items that have been previously learned. When newly-learned information is integrated into this structure, it is placed in some relationship to all other items, even if that relationship has not been explicitly learned. We demonstrate this effect for global semantic structures shaped topologically as a ring, and as a two-dimensional sheet. We also examined the utility of this learning algorithm for learning a more realistic semantic space by training it on a large pool of synonym pairs. Retrieved context enabled the model to “infer” relationships between synonym pairs that had not yet been presented.

View full details

Spotlight

Extending position/phase-shift tuning to motion energy neurons improves velocity discrimination

Stanley, Yiu Man LAM · Bertram E Shi

Dec 5, 5:20 PM - 5:30 PM

We extend position and phase-shift tuning, concepts already well established in the disparity energy neuron literature, to motion energy neurons. We show that Reichardt-like detectors can be considered examples of position tuning, and that motion energy filters whose complex valued spatio-temporal receptive fields are space-time separable can be considered examples of phase tuning. By combining these two types of detectors, we obtain an architecture for constructing motion energy neurons whose center frequencies can be adjusted by both phase and position shifts. Similar to recently described neurons in the primary visual cortex, these new motion energy neurons exhibit tuning that is between purely space-time separable and purely speed tuned. We propose a functional role for this intermediate level of tuning by demonstrating that comparisons between pairs of these motion energy neurons can reliably discriminate between inputs whose velocities lie above or below a given reference velocity

View full details

Spotlight

An in-silico Neural Model of Dynamic Routing through Neuronal Coherence

Devarajan Sridharan · Brian Percival · john arthur · Kwabena A Boahen

Dec 5, 5:20 PM - 5:30 PM

We describe a neurobiologically plausible mechanism to implement dynamic routing using the concept of neuronal communication through neuronal coherence. The model, implemented on a neuromorphic chip, incorporates a three-tier neural network architecture: the lowest tier comprises of raw input representations, the middle tier of the routing neurons, and the topmost tier of invariant output representation. The correct mapping between input and output representations is realized by an appropriate alignment of the phases of their background oscillations by the routing neurons. We demonstrate that our method is able to dramatically reduce the number of connections required from O($N^{3}$) to O($N^{2}$)

View full details

Spotlight

Invariant Common Spatial Patterns: Alleviating Nonstationarities in Brain-Computer Interfacing

Benjamin Blankertz · Motoaki Kawanabe · Ryota Tomioka · Friederike Hohlefeld · Vadim Nikulin · Klaus-Robert Müller

Dec 5, 5:20 PM - 5:30 PM

Brain-Computer Interfaces can suffer from a large variance of the subject conditions within and across sessions. For example vigilance fluctuations in the individual, variable task involvement, workload etc. alter the characteristics of EEG signals and thus challenge a stable BCI operation. In the present work we aim to define features based on a variant of the common spatial patterns (CSP) algorithm that are constructed invariant with respect to such nonstationarities. We enforce invariance properties by adding terms to the denominator of a Raleigh coefficient representation of CSP such as disturbance covariance matrices from fluctuations in visual processing. In this manner physiological prior knowledge can be used to shape the classification engine for BCI. As a proof of concept we present a BCI classifier that is robust to changes in the level of parietal alpha-activity. In other words, the EEG decoding still works when there are lapses in vigilance.

View full details

Spotlight

The rat as particle filter

Nathaniel D Daw · Aaron Courville

Dec 5, 5:20 PM - 5:30 PM

The core tenet of Bayesian modeling is that subjects represent beliefs as distributions over possible hypotheses. Such models have fruitfully been applied to the study of learning in the context of animal conditioning experiments (and anologously designed human learning tasks), where they explain phenomena such as retrospective revaluation that seem to demonstrate that subjects entertain multiple hypotheses simultaneously. However, a recent quantitative analysis of individual subject records by Gallistel and colleagues cast doubt on a very broad family of conditioning models by showing that all of the key features the models capture about even simple learning curves are artifacts of averaging over subjects. Rather than smooth learning curves (which Bayesian models interpret as revealing the gradual tradeoff from prior to posterior as data accumulate), subjects acquire suddenly, and their predictions continue to fluctuate abruptly. These data demand revisiting the model of the individual versus the ensemble, and also raise the worry that more sophisticated behaviors thought to support Bayesian models might also emerge artifactually from averaging over the simpler behavior of individuals. We suggest that the suddenness of changes in subjects' beliefs (as expressed in conditioned behavior) can be modeled by assuming they are conducting inference using sequential Monte Carlo sampling with a small number of samples --- one, in our simulations. Ensemble behavior resembles exact Bayesian models since, as in particle filters, it averages over many samples. Further, the model is capable of exhibiting sophisticated behaviors like retrospective revaluation at the ensemble level, even given minimally sophisticated individuals that do not track uncertainty from trial to trial. These results point to the need for more sophisticated experimental analysis to test Bayesian models, and refocus theorizing on the individual, while at the same time clarifying why the ensemble may be of interest.

View full details

Spotlight

Inferring Neural Firing Rates from Spike Trains Using Gaussian Processes

John P Cunningham · Byron M Yu · Krishna V Shenoy · Maneesh Sahani

Dec 5, 5:20 PM - 5:30 PM

Neural signals present challenges to analytical efforts due to their noisy, spiking nature. Many studies of neuroscientific and neural prosthetic importance rely on a smoothed, denoised neural signal considered to be the spike train's underlying firing rate. Current techniques to find time varying firing rates require ad hoc choices of parameters, offer no confidence intervals on their estimates, and can obscure potentially important single trial variability. We present a new method, based on a Gaussian Process prior, for inferring probabilistically optimal estimates of firing rate functions underlying single or multiple neural spike trains. We simulate spike trains to test the performance of the method and demonstrate significant average error improvement over standard smoothing techniques.

View full details

Spotlight

Contraction Properties of VLSI Cooperative Competitive Neural Networks of Spiking Neurons

Emre Neftci · Elisabetta Chicca · Giacomo Indiveri · Jean-Jacques Slotine · Rodney J Douglas

Dec 5, 5:20 PM - 5:30 PM

A non--linear dynamic system is called contracting if initial conditions are forgotten exponentially fast, so that all trajectories converge to a single trajectory which is the solution of the system. We use contraction theory to derive an upper bound for the strength of recurrent connections that guarantees contraction for complex neural networks. Specifically, we apply this theory to a special class of recurrent networks which are an abstract representation of the cooperative-competitive connectivity observed in cortex and often called Cooperative Competitive Networks (CCNs). This specific type of network is believed to play a major role in shaping cortical responses and selecting the relevant signal among distractors and noise. In this paper, we analyze contraction of combined CCNs of linear threshold units and verify the results of our analysis in a hybrid analog/digital VLSI CCN comprising spiking neurons and dynamic synapses.

View full details

Spotlight

Near-Maximum Entropy Models for Binary Neural Representations of Natural Images

Matthias Bethge · Philipp Berens

Dec 5, 5:20 PM - 5:30 PM

Maximum entropy analysis of binary variables provides an elegant way for studying the role of pairwise correlations in neural populations. Unfortunately, these approaches suffer from their poor scalability to high dimensions. In sensory coding, however, high-dimensional data is ubiquitous. Here, we introduce a new approach using a near-maximum entropy model, that makes this type of analysis feasible for very high-dimensional data---the model parameters can be derived in closed form and sampling is easy. We demonstrate its usefulness by studying a simple neural representation model of natural images. For the first time, we are able to directly compare predictions from a pairwise maximum entropy model not only in small groups of neurons, but also in larger populations of more than thousand units. Our results indicate that in such larger networks interactions exist that are not predicted by pairwise correlations, despite the fact that pairwise correlations explain the lower-dimensional marginal statistics extremely well up to the limit of dimensionality where estimation of the full joint distribution is feasible.

View full details

Spotlight

Measuring Neural Synchrony by Message Passing

Justin Dauwels · François Vialatte · Tomasz M Rutkowski · Andrzej S CICHOCKI

Dec 5, 5:20 PM - 5:30 PM

A novel approach to measure the interdependence of two time series is proposed, referred to as “stochastic event synchrony” (SES); it quantifies the alignment of two point processes by means of the following parameters: time delay, standard deviation of the timing jitter, the fraction of “spurious” events, and the average similarity of the events. In contrast to the other measures, SES quantifies the synchrony of oscillatory events (instead of more conventional amplitude or phase synchrony). Pairwise alignment of the point processes is cast as a statistical inference problem, which is solved by applying the max-product algorithm on a graphical model. The SES parameters are determined from the resulting pairwise alignment by maximum a posteriori (MAP) estimation. The proposed interdependence measure is applied to the problem of detecting anomalies in EEG synchrony of Mild Cognitive Impairment (MCI) patients.

View full details