Results 1  10
of
21
Fast and balanced: Efficient label tree learning for large scale object recognition
 In NIPS
, 2011
"... We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fin ..."
Abstract

Cited by 40 (5 self)
 Add to MetaCart
We present a novel approach to efficiently learn a label tree for large scale classification with many classes. The key contribution of the approach is a technique to simultaneously determine the structure of the tree and learn the classifiers for each node in the tree. This approach also allows fine grained control over the efficiency vs accuracy tradeoff in designing a label tree, leading to more balanced trees. Experiments are performed on large scale image classification with 10184 classes and 9 million images. We demonstrate significant improvements in test accuracy and efficiency with less training time and more balanced trees compared to the previous state of the art by Bengio et al. 1
Information, Divergence and Risk for Binary Experiments
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2009
"... We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all ..."
Abstract

Cited by 37 (8 self)
 Add to MetaCart
We unify fdivergences, Bregman divergences, surrogate regret bounds, proper scoring rules, cost curves, ROCcurves and statistical information. We do this by systematically studying integral and variational representations of these various objects and in so doing identify their primitives which all are related to costsensitive binary classification. As well as developing relationships between generative and discriminative views of learning, the new machinery leads to tight and more general surrogate regret bounds and generalised Pinsker inequalities relating fdivergences to variational divergence. The new viewpoint also illuminates existing algorithms: it provides a new derivation of Support Vector Machines in terms of divergences and relates Maximum Mean Discrepancy to Fisher Linear Discriminants.
Error Correcting Tournaments
, 2008
"... Abstract. We present a family of adaptive pairwise tournaments that are provably robust against large error fractions when used to determine the largest element in a set. The tournaments use nk pairwise comparisons but have only O(k + log n) depth, where n is the number of players and k is the robus ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
(Show Context)
Abstract. We present a family of adaptive pairwise tournaments that are provably robust against large error fractions when used to determine the largest element in a set. The tournaments use nk pairwise comparisons but have only O(k + log n) depth, where n is the number of players and k is the robustness parameter (for reasonable values of n and k). These tournaments also give a reduction from multiclass to binary classification in machine learning, yielding the best known analysis for the problem. 1
Doubly Robust Policy Evaluation and Learning
"... We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including healthcare policy and Internet advertising. A ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
We study decision making in environments where the reward is only partially observed, but can be modeled as a function of an action and an observed context. This setting, known as contextual bandits, encompasses a wide variety of applications including healthcare policy and Internet advertising. A central task is evaluation of a new policy given historic data consisting of contexts, actions and received rewards. The key challenge is that the past data typically does not faithfully represent proportions of actions taken by a new policy. Previous approaches rely either on models of rewards or models of the past policy. The former are plagued by a large bias whereas the latter have a large variance. In this work, we leverage the strength and overcome the weaknesses of the two approaches by applying the doubly robust technique to the problems of policy evaluation and optimization. We prove that this approach yields accurate value estimates when we have either a good (but not necessarily consistent) model of rewards or a good (but not necessarily consistent) model of past policy. Extensive empirical comparison demonstrates that the doubly robust approach uniformly improves over existing techniques, achieving both lower variance in value estimation and better policies. As such, we expect the doubly robust approach to become common practice. 1.
Intelligent Selection of ApplicationSpecific Garbage Collectors Abstract
"... Java program execution times vary greatly with different garbage collection algorithms. Until now, it has not been possible to determine the best GC algorithm for a particular program without exhaustively profiling that program for all available GC algorithms. This paper presents a new approach. We ..."
Abstract

Cited by 18 (4 self)
 Add to MetaCart
(Show Context)
Java program execution times vary greatly with different garbage collection algorithms. Until now, it has not been possible to determine the best GC algorithm for a particular program without exhaustively profiling that program for all available GC algorithms. This paper presents a new approach. We use machine learning techniques to build a prediction model that, given a single profile run of a previously unseen Java program, can predict a good GC algorithm for that program. We implement this technique in Jikes RVM and test it on several standard benchmark suites. Our technique achieves 5 % speedup in overall execution time (averaged across all test programs for all heap sizes) compared with selecting the default GC algorithm in every trial. We present further experiments to show that an oracle predictor could achieve an average 17 % speedup on the same experiments. In addition, we provide evidence to suggest that GC behaviour is sometimes independent of program inputs. These observations lead us to propose that intelligent selection of GC algorithms is suitably straightforward, efficient and effective to merit further exploration regarding its potential inclusion in the general Java software deployment process. Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors—Memory management (garbage collection)
Multiclass learnability and the erm principle
 In COLT
, 2011
"... Multiclass learning is an area of growing practical relevance, for which the currently available theory is still far from providing satisfactory understanding. We study the learnability of multiclass prediction, and derive upper and lower bounds on the sample complexity of multiclass hypothesis clas ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
Multiclass learning is an area of growing practical relevance, for which the currently available theory is still far from providing satisfactory understanding. We study the learnability of multiclass prediction, and derive upper and lower bounds on the sample complexity of multiclass hypothesis classes in different learning models: batch/online, realizable/unrealizable, full information/bandit feedback. Our analysis reveals a surprising phenomenon: In the multiclass setting, in sharp contrast to binary classification, not all Empirical Risk Minimization (ERM) algorithms are equally successful. We show that there exist hypotheses classes for which some ERM learners have lower sample complexity than others. Furthermore, there are classes that are learnable by some ERM learners, while other ERM learner will fail to learn them. We propose a principle for designing good ERM learners, and use this principle to prove tight bounds on the sample complexity of learning symmetric multiclass hypothesis classes (that is, classes that are invariant under any permutation of label names). We demonstrate the relevance of the theory by analyzing the sample complexity of two widely used hypothesis classes: generalized linear multiclass models and reduction trees. We also obtain some practically relevant conclusions. 1
Support vector machines for noise robust ASR
 In: Proc. ASRU
, 2009
"... Abstract—Using discriminative classifiers, such as Support Vector Machines (SVMs) in combination with, or as an alternative to, Hidden Markov Models (HMMs) has a number of advantages for difficult speech recognition tasks. For example, the models can make use of additional dependencies in the observ ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Abstract—Using discriminative classifiers, such as Support Vector Machines (SVMs) in combination with, or as an alternative to, Hidden Markov Models (HMMs) has a number of advantages for difficult speech recognition tasks. For example, the models can make use of additional dependencies in the observation sequences than HMMs provided the appropriate form of kernel is used. However standard SVMs are binary classifiers, and speech is a multiclass problem. Furthermore, to train SVMs to distinguish word pairs requires that each word appears in the training data. This paper examines both of these limitations. Treebased reduction approaches for multiclass classification are described, as well as some of the issues in applying them to dynamic data, such as speech. To address the training data issues, a simplified version of HMMbased synthesis can be used, which allows data for any wordpair to be generated. These approaches are evaluated on two noise corrupted digit sequence tasks: AURORA 2.0; and actual incar collected data. I.
Kernel Methods for TextIndependent Speaker Verification
, 2010
"... In recent years, systems based on support vector machines (SVMs) have become standard for speaker veriﬁcation (SV) tasks. An important aspect of these systems is the dynamic kernel.
These operate on sequence data and handle the dynamic nature of the speech. In this thesis a number of techniques are ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
In recent years, systems based on support vector machines (SVMs) have become standard for speaker veriﬁcation (SV) tasks. An important aspect of these systems is the dynamic kernel.
These operate on sequence data and handle the dynamic nature of the speech. In this thesis a number of techniques are proposed for improving dynamic kernelbased SV systems.
The ﬁrst contribution of this thesis is the development of alternative forms of dynamic kernel. Several popular dynamic kernels proposed for SV are based on the KullbackLeibler divergence between Gaussian mixture models. Since this has no closedform solution, typically a matchedpair upper bound is used instead. This places signiﬁcant restrictions on the forms of model structure that may be used. In this thesis, dynamic kernels are proposed based
on alternative, variational approximations to the divergence. Unlike standard approaches, these allow the use of a more ﬂexible modelling framework. Also, using a more accurate approximation may lead to performance gains.
The second contribution of this thesis is to investigate the combination of multiple systems to improve SV performance. Typically, systems are combined by fusing the output scores.
For SVM classiﬁers, an alternative strategy is to combine at the kernel level. Recently an efficient maximummargin scheme for learning kernel weights has been developed. In this thesis several modiﬁcations are proposed to allow this scheme to be applied to SV tasks.
System combination will only lead to gains when the kernels are complementary. In this thesis it is shown that many commonly used dynamic kernels can be placed into one of two broad classes, derivative and parametric kernels. The attributes of these classes are contrasted and the conditions under which the two forms of kernel are identical are described. By avoiding these conditions gains may be obtained by combining derivative and parametric kernels.
The ﬁnal contribution of this thesis is to investigate the combination of dynamic kernels with traditional static kernels for vector data. Here two general combination strategies are available: static kernel functions may be deﬁned over the dynamic feature vectors. Alternatively, a static kernel may be applied at the observation level. In general, it is not possible to explicitly train a model in the feature space associated with a static kernel. However, it is shown in this thesis that this form of kernel can be computed by using a suitable metric with approximate component posteriors. Generalised versions of standard parametric and derivative kernels, that include an observationlevel static kernel, are proposed based on this
approach.
Probabilistic Label Trees for Efficient Large Scale Image Classification
"... Largescale recognition problems with thousands of classes pose a particular challenge because applying the classifier requires more computation as the number of classes grows. The label tree model integrates classification with the traversal of the tree so that complexity grows logarithmically. In ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Largescale recognition problems with thousands of classes pose a particular challenge because applying the classifier requires more computation as the number of classes grows. The label tree model integrates classification with the traversal of the tree so that complexity grows logarithmically. In this paper, we show how the parameters of the label tree can be found using maximum likelihood estimation. This new probabilistic learning technique produces a label tree with significantly improved recognition accuracy. 1.
Multiclass Learning Approaches: A Theoretical Comparison with Implications
"... We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Treebased classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We theoretically analyze and compare the following five popular multiclass classification methods: One vs. All, All Pairs, Treebased classifiers, Error Correcting Output Codes (ECOC) with randomly generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a reduction to binary classification. We consider the case where the binary classifier comes from a class of VC dimension d, and in particular from the class of halfspaces over R d. We analyze both the estimation error and the approximation error of these methods. Our analysis reveals interesting conclusions of practical relevance, regarding the success of the different approaches under various conditions. Our proof technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This is in contrast to most previous uses of VC theory, which only deal with estimation error. 1