Results 1  10
of
259
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2000
"... We present a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a marginbased binary learning algorithm. The proposed framework unifies some of the most popular approaches in which each class ..."
Abstract

Cited by 561 (20 self)
 Add to MetaCart
(Show Context)
We present a unifying framework for studying the solution of multiclass categorization problems by reducing them to multiple binary problems that are then solved using a marginbased binary learning algorithm. The proposed framework unifies some of the most popular approaches in which each class is compared against all others, or in which all pairs of classes are compared to each other, or in which output codes with errorcorrecting properties are used. We propose a general method for combining the classifiers generated on the binary problems, and we prove a general empirical multiclass loss bound given the empirical loss of the individual binary learning algorithms. The scheme and the corresponding bounds apply to many popular classification learning algorithms including supportvector machines, AdaBoost, regression, logistic regression and decisiontree algorithms. We also give a multiclass generalization error analysis for general output codes with AdaBoost as the binary learner. Experimental results with SVM and AdaBoost show that our scheme provides a viable alternative to the most commonly used multiclass algorithms.
Strictly Proper Scoring Rules, Prediction, and Estimation
, 2007
"... Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he ..."
Abstract

Cited by 373 (28 self)
 Add to MetaCart
Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he or she issues the probabilistic forecast F, rather than G ̸ = F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the Savage representation. Examples of scoring rules for probabilistic forecasts in the form of predictive densities include the logarithmic, spherical, pseudospherical, and quadratic scores. The continuous ranked probability score applies to probabilistic forecasts that take the form of predictive cumulative distribution functions. It generalizes the absolute error and forms a special case of a new and very general type of score, the energy score. Like many other scoring rules, the energy score admits a kernel representation in terms of negative definite functions, with links to inequalities of Hoeffding type, in both univariate and multivariate settings. Proper scoring rules for quantile and interval forecasts are also discussed. We relate proper scoring rules to Bayes factors and to crossvalidation, and propose a novel form of crossvalidation known as randomfold crossvalidation. A case study on probabilistic weather forecasts in the North American Pacific Northwest illustrates the importance of propriety. We note optimum score approaches to point and quantile
Putting objects in perspective
 In CVPR
, 2006
"... Image understanding requires not only individually estimating elements of the visual world but also capturing the interplay among them. In this paper, we provide a framework for placing local object detection in the context of the overall 3D scene by modeling the interdependence of objects, surface ..."
Abstract

Cited by 307 (14 self)
 Add to MetaCart
(Show Context)
Image understanding requires not only individually estimating elements of the visual world but also capturing the interplay among them. In this paper, we provide a framework for placing local object detection in the context of the overall 3D scene by modeling the interdependence of objects, surface orientations, and camera viewpoint. Most object detection methods consider all scales and locations in the image as equally likely. We show that with probabilistic estimates of 3D geometry, both in terms of surfaces and world coordinates, we can put objects into perspective and model the scale and location variance in the image. Our approach reflects the cyclical nature of the problem by allowing probabilistic object hypotheses to refine geometry and viceversa. Our framework allows painless substitution of almost any object detector and is easily extended to include other aspects of image understanding. Our results confirm the benefits of our integrated approach. 1.
Geometric context from a single image.
 In Proc. Int. Conf. on Computer Vision.
, 2005
"... ..."
(Show Context)
Boosting with the L_2Loss: Regression and Classification
, 2001
"... This paper investigates a variant of boosting, L 2 Boost, which is constructed from a functional gradient descent algorithm with the L 2 loss function. Based on an explicit stagewise re tting expression of L 2 Boost, the case of (symmetric) linear weak learners is studied in detail in both regressi ..."
Abstract

Cited by 208 (17 self)
 Add to MetaCart
This paper investigates a variant of boosting, L 2 Boost, which is constructed from a functional gradient descent algorithm with the L 2 loss function. Based on an explicit stagewise re tting expression of L 2 Boost, the case of (symmetric) linear weak learners is studied in detail in both regression and twoclass classification. In particular, with the boosting iteration m working as the smoothing or regularization parameter, a new exponential biasvariance trade off is found with the variance (complexity) term bounded as m tends to infinity. When the weak learner is a smoothing spline, an optimal rate of convergence result holds for both regression and twoclass classification. And this boosted smoothing spline adapts to higher order, unknown smoothness. Moreover, a simple expansion of the 01 loss function is derived to reveal the importance of the decision boundary, bias reduction, and impossibility of an additive biasvariance decomposition in classification. Finally, simulation and real data set results are obtained to demonstrate the attractiveness of L 2 Boost, particularly with a novel componentwise cubic smoothing spline as an effective and practical weak learner.
Automatic photo popup
 in ACM SIGGRAPH
, 2005
"... Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of ..."
Abstract

Cited by 194 (9 self)
 Add to MetaCart
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions
An introduction to boosting and leveraging
 Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
(Show Context)
A Generalized Maximum Entropy Approach to Bregman Coclustering and Matrix Approximation
 In KDD
, 2004
"... Coclustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic coclustering approach applicable to empirical joint probability distributions was proposed. In many situations, coclust ..."
Abstract

Cited by 135 (29 self)
 Add to MetaCart
(Show Context)
Coclustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic coclustering approach applicable to empirical joint probability distributions was proposed. In many situations, coclustering of more general matrices is desired. In this paper, we present a substantially generalized coclustering framework wherein any Bregman divergence can be used in the objective function, and various conditional expectation based constraints can be considered based on the statistics that need to be preserved. Analysis of the coclustering problem leads to the minimum Bregman information principle, which generalizes the maximum entropy principle, and yields an elegant meta algorithm that is guaranteed to achieve local optimality. Our methodology yields new algorithms and also encompasses several previously known clustering and coclustering algorithms based on alternate minimization.
LogLinear Models for Label Ranking
, 2003
"... Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferenc ..."
Abstract

Cited by 109 (5 self)
 Add to MetaCart
Label ranking is the task of inferring a total order over a predefined set of labels for each given instance. We present a general framework for batch learning of label ranking functions from supervised data. We assume that each instance in the training data is associated with a list of preferences over the labelset, however we do not assume that this list is either complete or consistent. This enables us to accommodate a variety of ranking problems. In contrast to the general form of the supervision, our goal is to learn a ranking function that induces a total order over the entire set of labels. Special cases of our setting are multilabel categorization and hierarchical classification. We present a general boostingbased learning algorithm for the label ranking problem and prove a lower bound on the progress of each boosting iteration. The applicability of our approach is demonstrated with a set of experiments on a largescale text corpus.
A comparison of numerical optimizers for logistic regression
, 2003
"... Logistic regression is a workhorse of statistics and is closely related to methods used in Machine Learning, including the Perceptron and the Support Vector Machine. This note compares eight different algorithms for computing the maximum aposteriori parameter estimate. A full derivation of each alg ..."
Abstract

Cited by 108 (0 self)
 Add to MetaCart
(Show Context)
Logistic regression is a workhorse of statistics and is closely related to methods used in Machine Learning, including the Perceptron and the Support Vector Machine. This note compares eight different algorithms for computing the maximum aposteriori parameter estimate. A full derivation of each algorithm is given. In particular, a new derivation of Iterative Scaling is given which applies more generally than the conventional one. A new derivation is also given for the Modified Iterative Scaling algorithm of Collins et al. (2002). Most of the algorithms operate in the primal space, but can also work in dual space. All algorithms are compared in terms of computational complexity by experiments on large data sets. The fastest algorithms turn out to be conjugate gradient ascent and quasiNewton algorithms, which far outstrip Iterative Scaling and its variants. 1