Results 1  10
of
27
Label embedding trees for large multiclass tasks.
 In NIPS 24,
, 2010
"... Abstract Multiclass classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an a ..."
Abstract

Cited by 84 (2 self)
 Add to MetaCart
(Show Context)
Abstract Multiclass classification becomes challenging at test time when the number of classes is very large and testing against every possible class can become computationally infeasible. This problem can be alleviated by imposing (or learning) a structure over the set of classes. We propose an algorithm for learning a treestructure of classifiers which, by optimizing the overall tree loss, provides superior accuracy to existing tree labeling methods. We also propose a method that learns to embed labels in a low dimensional space that is faster than nonembedding approaches and has superior accuracy to existing embedding approaches. Finally we combine the two ideas resulting in the label embedding tree that outperforms alternative methods including OnevsRest while being orders of magnitude faster.
R.: Analysis of a classificationbased policy iteration algorithm
 In: Proceedings of the 27th International Conference on Machine Learning
, 2010
"... Abstract We present a classificationbased policy iteration algorithm, called Direct Policy Iteration, and provide its finitesample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
Abstract We present a classificationbased policy iteration algorithm, called Direct Policy Iteration, and provide its finitesample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classificationbased policy iteration setting. We also study the consistency of the method when there exists a sequence of policy spaces with increasing capacity.
Efficient Optimal Learning for Contextual Bandits
"... We address the problem of learning in an online setting where the learner repeatedly observes features x, selects among K actions, and receives reward r for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses an oracle which returns an optimal policy ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
(Show Context)
We address the problem of learning in an online setting where the learner repeatedly observes features x, selects among K actions, and receives reward r for the action taken. We provide the first efficient algorithm with an optimal regret. Our algorithm uses an oracle which returns an optimal policy given rewards for all actions for each x. The algorithm has running time polylog(N), where N is the number of policies that we compete with. This is exponentially faster than all previous algorithms that achieve optimal regret in this setting. Our formulation also enables us to create an algorithm with regret that is additive rather than multiplicative in feedback delay as in all previous work. 1.
Label Partitioning For Sublinear Ranking
"... We consider the case of ranking a very large set of labels, items, or documents, which is common to information retrieval, recommendation, and largescale annotation tasks. We present a general approach for converting an algorithm which has linear time in the size of the set to a sublinear one via l ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
We consider the case of ranking a very large set of labels, items, or documents, which is common to information retrieval, recommendation, and largescale annotation tasks. We present a general approach for converting an algorithm which has linear time in the size of the set to a sublinear one via label partitioning. Our method consists of learning an input partition and a label assignment to each partition of the space such that precision at k is optimized, which is the loss function of interest in this setting. Experiments on largescale ranking and recommendation tasks show that our method not only makes the original linear time algorithm computationally tractable, but can also improve its performance. 1.
Active Learning Ranking from Pairwise Preferences with Almost Optimal Query Complexity
"... Given a set V of n elements we wish to linearly order them using pairwise preference labels which may be nontransitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measu ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Given a set V of n elements we wish to linearly order them using pairwise preference labels which may be nontransitive (due to irrationality or arbitrary noise). The goal is to linearly order the elements while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The number of disagreements (loss) and the query complexity (number of pairwise preference labels). Our algorithm adaptively queries at most O(n poly(log n, ε−1)) preference labels for a regret of ε times the optimal loss. This is strictly better, and often significantly better than what nonadaptive sampling could achieve. Our main result helps settle an open problem posed by learningtorank (from pairwise information) theoreticians and practitioners: What is a provably correct way to sample preference labels? 1
Multiclass learnability and the ERM principle
 In COLT, volume 19 of JMLR Proceedings
, 2011
"... Abstract We study the sample complexity of multiclass prediction in several learning settings. For the PAC setting our analysis reveals a surprising phenomenon: In sharp contrast to binary classification, we show that there exist multiclass hypothesis classes for which some Empirical Risk Minimizer ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Abstract We study the sample complexity of multiclass prediction in several learning settings. For the PAC setting our analysis reveals a surprising phenomenon: In sharp contrast to binary classification, we show that there exist multiclass hypothesis classes for which some Empirical Risk Minimizers (ERM learners) have lower sample complexity than others. Furthermore, there are classes that are learnable by some ERM learners, while other ERM learners will fail to learn them. We propose a principle for designing good ERM learners, and use this principle to prove tight bounds on the sample complexity of learning symmetric multiclass hypothesis classesclasses that are invariant under permutations of label names. We further provide a characterization of mistake and regret bounds for multiclass learning in the online setting and the bandit setting, using new generalizations of Littlestone's dimension.
Machine Learning Techniques—Reductions Between Prediction Quality Metrics
"... Abstract Machine learning involves optimizing a loss function on unlabeled data points given examples of labeled data points, where the loss function measures the performance of a learning algorithm. We give an overview of techniques, called reductions, for converting a problem of minimizing one los ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract Machine learning involves optimizing a loss function on unlabeled data points given examples of labeled data points, where the loss function measures the performance of a learning algorithm. We give an overview of techniques, called reductions, for converting a problem of minimizing one loss function into a problem of minimizing another, simpler loss function. This tutorial discusses how to create robust reductions that perform well in practice. The reductions discussed here can be used to solve any supervised learning problem with a standard binary classification or regression algorithm available in any machine learning toolkit. We also discuss common design flaws in folklore reductions. 1
A hierarchical classifier applied to multiway sentiment detection
 In COLING 2010: Proceedings of the 23rd International Conference on Computational Linguistics
, 2010
"... This paper considers the problem of documentlevel multiway sentiment detection, proposing a hierarchical classifier algorithm that accounts for the interclass similarity of tagged sentimentbearing texts. This type of classifier also provides a natural mechanism for reducing the feature space of ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
This paper considers the problem of documentlevel multiway sentiment detection, proposing a hierarchical classifier algorithm that accounts for the interclass similarity of tagged sentimentbearing texts. This type of classifier also provides a natural mechanism for reducing the feature space of the problem. Our results show that this approach improves on stateoftheart predictive performance for movie reviews with threestar and fourstar ratings, while simultaneously reducing training times and memory requirements. 1
Logarithmic time online multiclass prediction. arXiv preprint arXiv:1406.1822
, 2014
"... We study the problem of multiclass classification with an extremely large number of classes, with the goal of obtaining train and test time complexity logarithmic in the number of classes. We develop topdown tree construction approaches for constructing logarithmic depth trees. On the theoretical f ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We study the problem of multiclass classification with an extremely large number of classes, with the goal of obtaining train and test time complexity logarithmic in the number of classes. We develop topdown tree construction approaches for constructing logarithmic depth trees. On the theoretical front, we formulate a new objective function, which is optimized at each node of the tree and creates dynamic partitions of the data which are both pure (in terms of class labels) and balanced. We demonstrate that under favorable conditions, we can construct logarithmic depth trees that have leaves with low label entropy. However, the objective function at the nodes is challenging to optimize computationally. We address the empirical problem with a new online decision tree construction procedure. Experiments demonstrate that this online algorithm quickly achieves small error rates relative to more common O(k) approaches and simultaneously achieves significant improvement in test error compared to other logarithmic training time approaches. 1
Discriminative probabilistic prototype learning
"... In this paper we propose a simple yet powerful method for learning representations in supervised learning scenarios where an input datapoint is described by a set of feature vectors and its associated output may be given by soft labels indicating, for example, class probabilities. We represent an ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
In this paper we propose a simple yet powerful method for learning representations in supervised learning scenarios where an input datapoint is described by a set of feature vectors and its associated output may be given by soft labels indicating, for example, class probabilities. We represent an input datapoint as a Kdimensional vector, where each component is a mixture of probabilities over its corresponding set of feature vectors. Each probability indicates how likely a feature vector is to belong to oneoutofK unknown prototype patterns. We propose a probabilistic model that parameterizes these prototype patterns in terms of hidden variables and therefore it can be trained with conventional approaches based on likelihood maximization. More importantly, both the model parameters and the prototype patterns can be learned from data in a discriminative way. We show that our model can be seen as a probabilistic generalization of learning vector quantization (LVQ). We apply our method to the problems of shape classification, hyperspectral imaging classification and people’s work class categorization, showing the superior performance of our method compared to the standard prototypebased classification approach and other competitive benchmarks. 1.