Results 1  10
of
93
A Dual Coordinate Descent Method for Largescale Linear SVM
"... In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such largescale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1 and L2loss functi ..."
Abstract

Cited by 199 (18 self)
 Add to MetaCart
(Show Context)
In many applications, data appear with a huge number of instances as well as features. Linear Support Vector Machines (SVM) is one of the most popular tools to deal with such largescale sparse data. This paper presents a novel dual coordinate descent method for linear SVM with L1 and L2loss functions. The proposed method is simple and reaches an ɛaccurate solution in O(log(1/ɛ)) iterations. Experiments indicate that our method is much faster than state of the art solvers such as Pegasos, TRON, SVM perf, and a recent primal coordinate descent implementation. 1.
Stochastic Dual Coordinate Ascent Methods
, 2013
"... Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it ..."
Abstract

Cited by 88 (11 self)
 Add to MetaCart
Stochastic Gradient Descent (SGD) has become popular for solving large scale supervised machine learning optimization problems such as SVM, due to their strong theoretical guarantees. While the closely related Dual Coordinate Ascent (DCA) method has been implemented in various software packages, it has so far lacked good convergence analysis. This paper presents a new analysis of Stochastic Dual Coordinate Ascent (SDCA) showing that this class of methods enjoy strong theoretical guarantees that are comparable or better than SGD. This analysis justifies the effectiveness of SDCA for practical applications.
Maxmargin hidden conditional random fields for human action recognition
, 2008
"... We present a new method for classification with structured latent variables. Our model is formulated using the maxmargin formalism in the discriminative learning literature. We propose an efficient learning algorithm based on the cutting plane method and decomposed dual optimization. We apply our m ..."
Abstract

Cited by 52 (9 self)
 Add to MetaCart
(Show Context)
We present a new method for classification with structured latent variables. Our model is formulated using the maxmargin formalism in the discriminative learning literature. We propose an efficient learning algorithm based on the cutting plane method and decomposed dual optimization. We apply our model to the problem of recognizing human actions from video sequences, where we model a human action as a global root template and a constellation of several “parts”. We show that our model outperforms another similar method that uses hidden conditional random fields, and is comparable to other stateoftheart approaches. More importantly, our proposed work is quite general and can potentially be applied in a wide variety of vision problems that involve various complex, interdependent latent structures. 1.
Online EM for unsupervised models
 In Proc. of NAACL
, 2009
"... The (batch) EM algorithm plays an important role in unsupervised induction, but it sometimes suffers from slow convergence. In this paper, we show that online variants (1) provide significant speedups and (2) can even find better solutions than those found by batch EM. We support these findings on f ..."
Abstract

Cited by 49 (2 self)
 Add to MetaCart
(Show Context)
The (batch) EM algorithm plays an important role in unsupervised induction, but it sometimes suffers from slow convergence. In this paper, we show that online variants (1) provide significant speedups and (2) can even find better solutions than those found by batch EM. We support these findings on four unsupervised tasks: partofspeech tagging, document classification, word segmentation, and word alignment. 1
Stochastic Gradient Descent Training for L1regularized Loglinear Models with Cumulative Penalty
"... Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1re ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1regularization, which is becoming popular in natural language processing because of its ability to produce compact models, cannot be efficiently applied in SGD training, due to the large dimensions of feature vectors and the fluctuations of approximate gradients. We present a simple method to solve these problems by penalizing the weights according to cumulative values for L1 penalty. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and partofspeech tagging. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a stateoftheart quasiNewton method for L1regularized loglinear models. 1
Learning Efficiently with Approximate Inference via Dual Losses
"... Many structured prediction tasks involve complex models where inference is computationally intractable, but where it can be well approximated using a linear programming relaxation. Previous approaches for learning for structured prediction (e.g., cuttingplane, subgradient methods, perceptron) repeat ..."
Abstract

Cited by 37 (8 self)
 Add to MetaCart
Many structured prediction tasks involve complex models where inference is computationally intractable, but where it can be well approximated using a linear programming relaxation. Previous approaches for learning for structured prediction (e.g., cuttingplane, subgradient methods, perceptron) repeatedly make predictions for some of the data points. These approaches are computationally demanding because each prediction involves solving a linear program to optimality. We present a scalable algorithm for learning for structured prediction. The main idea is to instead solve the dual of the structured prediction loss. We formulate the learning task as a convex minimization over both the weights and the dual variables corresponding to each data point. As a result, we can begin to optimize the weights even before completely solving any of the individual prediction problems. We show how the dual variables can be efficiently optimized using coordinate descent. Our algorithm is competitive with stateoftheart methods such as stochastic subgradient and cuttingplane. 1.
A Sequential Dual Method for Large Scale MultiClass Linear SVMs
, 2008
"... Efficient training of direct multiclass formulations of linear Support Vector Machines is very useful in applications such as text classification with a huge number examples as well as features. This paper presents a fast dual method for this training. The main idea is to sequentially traverse thro ..."
Abstract

Cited by 37 (8 self)
 Add to MetaCart
(Show Context)
Efficient training of direct multiclass formulations of linear Support Vector Machines is very useful in applications such as text classification with a huge number examples as well as features. This paper presents a fast dual method for this training. The main idea is to sequentially traverse through the training set and optimize the dual variables associated with one example at a time. The speed of training is enhanced further by shrinking and cooling heuristics. Experiments indicate that our method is much faster than state of the art solvers such as bundle, cutting plane and exponentiated gradient methods.
A primaldual messagepassing algorithm for approximated large scale structured prediction
 In Advances in Neural Information Processing Systems 23
, 2010
"... In this paper we propose an approximated structured prediction framework for large scale graphical models and derive messagepassing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the logpartition function, known as ..."
Abstract

Cited by 36 (18 self)
 Add to MetaCart
(Show Context)
In this paper we propose an approximated structured prediction framework for large scale graphical models and derive messagepassing algorithms for learning their parameters efficiently. We first relate CRFs and structured SVMs and show that in CRFs a variant of the logpartition function, known as the softmax, smoothly approximates the hinge loss function of structured SVMs. We then propose an intuitive approximation for the structured prediction problem, using duality, based on a local entropy approximation and derive an efficient messagepassing algorithm that is guaranteed to converge. Unlike existing approaches, this allows us to learn efficiently graphical models with cycles and very large number of parameters. 1
Minimizing Finite Sums with the Stochastic Average Gradient
, 2013
"... We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradie ..."
Abstract

Cited by 34 (2 self)
 Add to MetaCart
We propose the stochastic average gradient (SAG) method for optimizing the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method’s iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values the SAG method achieves a faster convergence rate than blackbox SG methods. The convergence rate is improved from O(1 / √ k) to O(1/k) in general, and when the sum is stronglyconvex the convergence rate is improved from the sublinear O(1/k) to a linear convergence rate of the form O(ρ k) for ρ < 1. Further, in many cases the convergence rate of the new method is also faster than blackbox deterministic gradient methods, in terms of the number of gradient evaluations. Numerical experiments indicate that the new algorithm often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of nonuniform sampling strategies. 1
Recent Advances of Largescale Linear Classification
"... Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
(Show Context)
Linear classification is a useful tool in machine learning and data mining. For some data in a rich dimensional space, the performance (i.e., testing accuracy) of linear classifiers has shown to be close to that of nonlinear classifiers such as kernel methods, but training and testing speed is much faster. Recently, many research works have developed efficient optimization methods to construct linear classifiers and applied them to some largescale applications. In this paper, we give a comprehensive survey on the recent development of this active research area.