Session
Spotlights
Dan Klein
A Kernel Statistical Test of Independence
Arthur Gretton · Kenji Fukumizu · Choon Hui Teo · Le Song · Bernhard Schölkopf · Alexander Smola
Whereas kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically significant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m^2), where m is the sample size. We demonstrate that this test outperforms established contingency table-based tests. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.
Anytime Induction of Cost-sensitive Trees
Saher Esmeir · Shaul Markovitch
Machine learning techniques are increasingly being used to produce a wide-range of classifiers for complex real-world applications that involve nonuniform testing costs and misclassification costs. As the complexity of these applications grows, the management of resources during the learning and classification processes becomes a challenging task. In this work we introduce ACT Anytime Cost-sensitive Trees), a novel framework for operating in such environments. ACT is an anytime algorithm that allows trading computation time for lower classification costs. It builds a tree top-down and exploits additional time resources to obtain better estimations for the utility of the different candidate splits. Using sampling techniques ACT approximates for each candidate split the cost of the subtree under it and favors the one with a minimal cost. Due to its stochastic nature ACT is expected to be able to escape local minima, into which greedy methods may be trapped. Experiments with a variety of datasets were conducted to compare the performance of ACT to that of the state of the art cost-sensitive tree learners. The results show that for most domains ACT produces trees of significantly lower costs. ACT is also shown to exhibit good anytime behavior with diminishing returns.
Boosting Algorithms for Maximizing the Soft Margin
Manfred K. Warmuth · Karen Glocer · Gunnar Rätsch
We present a novel boosting algorithm, called Softboost, designed for sets of binary labeled examples that are not necessarily separable by convex combinations of base hypotheses. Our algorithm aims to achieve robustness by \emph{capping} the distributions on the examples. Our update of the distribution is motivated by minimizing a relative entropy subject to the capping constraints and constraints on the edges of the obtained base hypotheses. The capping constraints imply a soft margin in the dual optimization problem and our algorithm produces a convex combination of hypotheses whose \emph{soft margin} is within $\delta$ of the optimum. We employ relative entropy projection methods to prove an $O(\frac{\ln N}{\delta^2})$ iteration bound for our algorithm, where $N$ is number of examples.
Bundle Methods for Machine Learning
Alexander Smola · Vishwanathan S V N · Quoc V Le
We present a globally convergent method for regularized risk minimization problems. Our method applies to Support Vector estimation, regression, Gaussian Processes, and any other regularized risk minimization setting which leads to a convex optimization problem. SVMPerf can be shown to be a special case of our approach. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in $O(1/\epsilon)$ steps to $\epsilon$ precision for general convex problems and in $O(\log \epsilon)$ steps for continuously differentiable problems. We demonstrate in experiments the performance of our approach.
Classification via Minimum Incremental Coding Length (MICL)
John Wright · Yangyu Tao · Zhouchen Lin · Yi Ma · Heung-Yeung Shum
We present a simple new criterion for classification, based on principles from lossy data compression. The criterion assigns a test sample to the class that uses the minimum number of additional bits to code the test sample, subject to an allowable distortion. We rigorously prove asymptotic optimality of this criterion for Gaussian (normal) distributions and analyze its relationships to classical classifiers. The theoretical results provide new insights into the relationships among a variety of popular classifiers such as MAP, RDA, k-NN, and SVM, as well as unsupervised methods based on lossy coding \cite{Ma2007-PAMI}. Our formulation induces several good effects on the resulting classifier. First, minimizing the lossy coding length induces a regularization effect which stabilizes the (implicit) density estimate in a small sample setting. Second, compression provides a uniform means of handling classes of varying dimension. The new criterion and its kernel and local versions perform competitively on synthetic examples, as well as on real imagery data such as handwritten digits and face images. On these problems, the performance of our simple classifier approaches the best reported results, without using domain-specific information. All MATLAB code and classification results will be made publicly available for peer evaluation.
Consistent Minimization of Clustering Objective Functions
Ulrike von Luxburg · Sebastien Bubeck · Stefanie S Jegelka · Michael Kaufmann
Clustering is often formulated as a discrete optimization problem. The objective is to find, among all partitions of the data set, the best one according to some quality measure. However, in the statistical setting where we assume that the finite data set has been sampled from some underlying space, the goal is not to find the best partition of the given sample, but to approximate the true partition of the underlying space. We argue that the discrete optimization approach usually does not achieve this goal. As an alternative, we suggest the paradigm of ``nearest neighbor clustering''. Instead of selecting the best out of all partitions of the sample, it only considers partitions in some restricted function class. Using tools from statistical learning theory we prove that nearest neighbor clustering is statistically consistent. Moreover, its worst case complexity is polynomial by construction, and it can be implemented with small average case complexity using branch and bound.
Iterative Non-linear Dimensionality Reduction with Manifold Sculpting
Michael S Gashler · Dan Ventura · Tony Martinez
Many algorithms have been recently developed for reducing dimensionality by projecting data onto an intrinsic non-linear manifold. Unfortunately, existing algorithms often lose significant precision in this transformation. Manifold Sculpting is a new algorithm that iteratively reduces dimensionality by simulating surface tension in local neighborhoods. We present several experiments that show Manifold Sculpting yields more accurate results than existing algorithms with both generated and natural data-sets. Manifold Sculpting is also able to benefit from both prior dimensionality reduction efforts.
McRank: Learning to Rank Using Multiple Classification and Gradient Boosting
Ping Li · Chris J Burges · Qiang Wu
We cast the ranking problem as (1) multiple classification (``Mc'') (2) multiple ordinal classification, which lead to computationally tractable learning algorithms for relevance ranking in Web search. We consider the DCG criterion (discounted cumulative gain), a standard quality measure in information retrieval. Our approach is motivated by the fact that perfect classifications result in perfect DCG scores and the DCG errors are bounded by classification errors. We propose using the {\em Expected Relevance} to convert class probabilities into ranking scores. The class probabilities are learned using a gradient boosting tree algorithm. Evaluations on large-scale datasets show that our approach can improve {\em LambdaRank}\cite{Proc:BurgesNIPS06} and the regressions-based ranker\cite{Proc:ZhangCOLT06}, in terms of the (normalized) DCG scores. An efficient implementation of the boosting tree algorithm is also presented.
Random Features for Large-Scale Kernel Machines
ali rahimi · Benjamin Recht
To accelerate the training of kernel machines, we propose to map the input data to a randomized low-dimensional feature space and then apply existing fast linear methods. The features are designed so that the inner products of the transformed data are approximately equal to those in the feature space of a user specified shift-invariant kernel. We explore two sets of random features, provide convergence bounds on their ability to approximate various radial basis kernels, and show that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large-scale kernel machines.