Results 1  10
of
483
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
, 2010
"... ..."
(Show Context)
On the algorithmic implementation of multiclass kernelbased vector machines
 Journal of Machine Learning Research
"... In this paper we describe the algorithmic implementation of multiclass kernelbased vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic ob ..."
Abstract

Cited by 547 (14 self)
 Add to MetaCart
In this paper we describe the algorithmic implementation of multiclass kernelbased vector machines. Our starting point is a generalized notion of the margin to multiclass problems. Using this notion we cast multiclass categorization problems as a constrained optimization problem with a quadratic objective function. Unlike most of previous approaches which typically decompose a multiclass problem into multiple independent binary classification tasks, our notion of margin yields a direct method for training multiclass predictors. By using the dual of the optimization problem we are able to incorporate kernels with a compact set of constraints and decompose the dual problem into multiple optimization problems of reduced size. We describe an efficient fixedpoint algorithm for solving the reduced optimization problems and prove its convergence. We then discuss technical details that yield significant running time improvements for large datasets. Finally, we describe various experiments with our approach comparing it to previously studied kernelbased methods. Our experiments indicate that for multiclass problems we attain stateoftheart accuracy.
Sequential minimal optimization: A fast algorithm for training support vector machines
 Advances in Kernel MethodsSupport Vector Learning
, 1999
"... This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possi ..."
Abstract

Cited by 451 (3 self)
 Add to MetaCart
(Show Context)
This paper proposes a new algorithm for training support vector machines: Sequential Minimal Optimization, or SMO. Training a support vector machine requires the solution of a very large quadratic programming (QP) optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems. These small QP problems are solved analytically, which avoids using a timeconsuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear in the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems, while the standard chunking SVM algorithm scales somewhere between linear and cubic in the training set size. SMO’s computation time is dominated by SVM evaluation, hence SMO is fastest for linear SVMs and sparse data sets. On realworld sparse data sets, SMO can be more than 1000 times faster than the chunking algorithm. 1.
Clustering with Bregman Divergences
 JOURNAL OF MACHINE LEARNING RESEARCH
, 2005
"... A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergence ..."
Abstract

Cited by 441 (59 self)
 Add to MetaCart
(Show Context)
A wide variety of distortion functions are used for clustering, e.g., squared Euclidean distance, Mahalanobis distance and relative entropy. In this paper, we propose and analyze parametric hard and soft clustering algorithms based on a large class of distortion functions known as Bregman divergences. The proposed algorithms unify centroidbased parametric clustering approaches, such as classical kmeans and informationtheoretic clustering, which arise by special choices of the Bregman divergence. The algorithms maintain the simplicity and scalability of the classical kmeans algorithm, while generalizing the basic idea to a very large class of clustering loss functions. There are two main contributions in this paper. First, we pose the hard clustering problem in terms of minimizing the loss in Bregman information, a quantity motivated by ratedistortion theory, and present an algorithm to minimize this loss. Secondly, we show an explicit bijection between Bregman divergences and exponential families. The bijection enables the development of an alternative interpretation of an ecient EM scheme for learning models involving mixtures of exponential distributions. This leads to a simple soft clustering algorithm for all Bregman divergences.
Strictly Proper Scoring Rules, Prediction, and Estimation
, 2007
"... Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he ..."
Abstract

Cited by 357 (27 self)
 Add to MetaCart
(Show Context)
Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he or she issues the probabilistic forecast F, rather than G ̸ = F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the Savage representation. Examples of scoring rules for probabilistic forecasts in the form of predictive densities include the logarithmic, spherical, pseudospherical, and quadratic scores. The continuous ranked probability score applies to probabilistic forecasts that take the form of predictive cumulative distribution functions. It generalizes the absolute error and forms a special case of a new and very general type of score, the energy score. Like many other scoring rules, the energy score admits a kernel representation in terms of negative definite functions, with links to inequalities of Hoeffding type, in both univariate and multivariate settings. Proper scoring rules for quantile and interval forecasts are also discussed. We relate proper scoring rules to Bayes factors and to crossvalidation, and propose a novel form of crossvalidation known as randomfold crossvalidation. A case study on probabilistic weather forecasts in the North American Pacific Northwest illustrates the importance of propriety. We note optimum score approaches to point and quantile
Stochastic Approximation Approach to Stochastic Programming
"... In this paper we consider optimization problems where the objective function is given in a form of the expectation. A basic difficulty of solving such stochastic optimization problems is that the involved multidimensional integrals (expectations) cannot be computed with high accuracy. The aim of th ..."
Abstract

Cited by 266 (18 self)
 Add to MetaCart
(Show Context)
In this paper we consider optimization problems where the objective function is given in a form of the expectation. A basic difficulty of solving such stochastic optimization problems is that the involved multidimensional integrals (expectations) cannot be computed with high accuracy. The aim of this paper is to compare two computational approaches based on Monte Carlo sampling techniques, namely, the Stochastic Approximation (SA) and the Sample Average Approximation (SAA) methods. Both approaches, the SA and SAA methods, have a long history. Current opinion is that the SAA method can efficiently use a specific (say linear) structure of the considered problem, while the SA approach is a crude subgradient method which often performs poorly in practice. We intend to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems. We extend the analysis to the case of convexconcave stochastic saddle point problems, and present (in our opinion highly encouraging) results of numerical experiments.
Logistic Regression, AdaBoost and Bregman Distances
, 2000
"... We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt al ..."
Abstract

Cited by 261 (44 self)
 Add to MetaCart
We give a unified account of boosting and logistic regression in which each learning problem is cast in terms of optimization of Bregman distances. The striking similarity of the two problems in this framework allows us to design and analyze algorithms for both simultaneously, and to easily adapt algorithms designed for one problem to the other. For both problems, we give new algorithms and explain their potential advantages over existing methods. These algorithms can be divided into two types based on whether the parameters are iteratively updated sequentially (one at a time) or in parallel (all at once). We also describe a parameterized family of algorithms which interpolates smoothly between these two extremes. For all of the algorithms, we give convergence proofs using a general formalization of the auxiliaryfunction proof technique. As one of our sequentialupdate algorithms is equivalent to AdaBoost, this provides the first general proof of convergence for AdaBoost. We show that all of our algorithms generalize easily to the multiclass case, and we contrast the new algorithms with iterative scaling. We conclude with a few experimental results with synthetic data that highlight the behavior of the old and newly proposed algorithms in different settings.
An iterative regularization method for total variationbased image restoration, Multiscale Model
 Simul
"... Abstract. We introduce a new iterative regularization procedure for inverse problems based on the use of Bregman distances, with particular focus on problems arising in image processing. We are motivated by the problem of restoring noisy and blurry images via variational methods by using total varia ..."
Abstract

Cited by 195 (29 self)
 Add to MetaCart
(Show Context)
Abstract. We introduce a new iterative regularization procedure for inverse problems based on the use of Bregman distances, with particular focus on problems arising in image processing. We are motivated by the problem of restoring noisy and blurry images via variational methods by using total variation regularization. We obtain rigorous convergence results and effective stopping criteria for the general procedure. The numerical results for denoising appear to give significant improvement over standard models, and preliminary results for deblurring/denoising are very encouraging.
Fixed point and Bregman iterative methods for matrix rank minimization
 MATH. PROGRAM., SER. A
, 2008
"... ..."
TENSOR RANK AND THE ILLPOSEDNESS OF THE BEST LOWRANK APPROXIMATION PROBLEM
"... There has been continued interest in seeking a theorem describing optimal lowrank approximations to tensors of order 3 or higher, that parallels the Eckart–Young theorem for matrices. In this paper, we argue that the naive approach to this problem is doomed to failure because, unlike matrices, te ..."
Abstract

Cited by 193 (13 self)
 Add to MetaCart
There has been continued interest in seeking a theorem describing optimal lowrank approximations to tensors of order 3 or higher, that parallels the Eckart–Young theorem for matrices. In this paper, we argue that the naive approach to this problem is doomed to failure because, unlike matrices, tensors of order 3 or higher can fail to have best rankr approximations. The phenomenon is much more widespread than one might suspect: examples of this failure can be constructed over a wide range of dimensions, orders and ranks, regardless of the choice of norm (or even Brègman divergence). Moreover, we show that in many instances these counterexamples have positive volume: they cannot be regarded as isolated phenomena. In one extreme case, we exhibit a tensor space in which no rank3 tensor has an optimal rank2 approximation. The notable exceptions to this misbehavior are rank1 tensors and order2 tensors (i.e. matrices). In a more positive spirit, we propose a natural way of overcoming the illposedness of the lowrank approximation problem, by using weak solutions when true solutions do not exist. For this to work, it is necessary to characterize the set of weak solutions, and we do this in the case of rank 2, order 3 (in arbitrary dimensions). In our work we emphasize the importance of closely studying concrete lowdimensional examples as a first step towards more general results. To this end, we present a detailed analysis of equivalence classes of 2 × 2 × 2 tensors, and we develop methods for extending results upwards to higher orders and dimensions. Finally, we link our work to existing studies of tensors from an algebraic geometric point of view. The rank of a tensor can in theory be given a semialgebraic description; in other words, can be determined by a system of polynomial inequalities. We study some of these polynomials in cases of interest to us; in particular we make extensive use of the hyperdeterminant ∆ on R 2×2×2.