Results 1 - 10
of
27
Dual averaging methods for regularized stochastic learning and online optimization
- In Advances in Neural Information Processing Systems 23
, 2009
"... We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1-norm for promoting sparsity. We develop extensions of Nes ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
We consider regularized stochastic learning and online optimization problems, where the objective function is the sum of two convex terms: one is the loss function of the learning task, and the other is a simple regularization term such as ℓ1-norm for promoting sparsity. We develop extensions of Nesterov’s dual averaging method, that can exploit the regularization structure in an online setting. At each iteration of these methods, the learning variables are adjusted by solving a simple minimization problem that involves the running average of all past subgradients of the loss function and the whole regularization term, not just its subgradient. In the case of ℓ1-regularization, our method is particularly effective in obtaining sparse solutions. We show that these methods achieve the optimal convergence rates or regret bounds that are standard in the literature on stochastic and online convex optimization. For stochastic learning problems in which the loss functions have Lipschitz continuous gradients, we also present an accelerated version of the dual averaging method.
Efficient Online and Batch Learning using Forward Backward Splitting
"... We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1. We derive concrete and very simple algorithms for minimization of loss functions with ℓ1, ℓ2, ℓ 2 2, and ℓ ∞ regularization. We also show how to construct efficient algorithms for mixed-norm ℓ1/ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural datasets.
Predicting bounce rates in sponsored search advertisements
- In SIGKDD Conference on Knowledge Discovery and Data Mining (KDD
, 2009
"... This paper explores an important and relatively unstudied quality measure of a sponsored search advertisement: bounce rate. The bounce rate of an ad can be informally defined as the fraction of users who click on the ad but almost immediately move on to other tasks. A high bounce rate can lead to po ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
This paper explores an important and relatively unstudied quality measure of a sponsored search advertisement: bounce rate. The bounce rate of an ad can be informally defined as the fraction of users who click on the ad but almost immediately move on to other tasks. A high bounce rate can lead to poor advertiser return on investment, and suggests search engine users may be having a poor experience following the click. In this paper, we first provide quantitative analysis showing that bounce rate is an effective measure of user satisfaction. We then address the question, can we predict bounce rate by analyzing the features of the advertisement? An affirmative answer would allow advertisers and search engines to predict the effectiveness and quality of advertisements before they are shown. We propose solutions to this problem involving large-scale learning methods that leverage features drawn from ad creatives in addition
Stochastic Methods for ℓ1 Regularized Loss Minimization Shai Shalev-Shwartz
"... We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteratio ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
We describe and analyze two stochastic methods for ℓ1 regularized loss minimization problems, such as the Lasso. The first method updates the weight of a single feature at each iteration while the second method updates the entire weight vector but only uses a single training example at each iteration. In both methods, the choice of feature/example is uniformly at random. Our theoretical runtime analysis suggests that the stochastic methods should outperform state-of-the-art deterministic approaches, including their deterministic counterparts, when the size of the problem is large. We demonstrate the advantage of stochastic methods by experimenting with synthetic and natural data sets. 1.
Efficient learning using forwardbackward splitting
- in Advances in Neural Information Processing Systems 23
, 2009
"... We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization prob ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1. We derive concrete and very simple algorithms for minimization of loss functions with ℓ1, ℓ2, ℓ 2 2, and ℓ ∞ regularization. We also show how to construct efficient algorithms for mixed-norm ℓ1/ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets. 1
Composite Objective Mirror Descent
"... We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields n ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields new analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as ℓ1, mixed norm, and trace-norm. 1
Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty
"... Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1-re ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Stochastic gradient descent (SGD) uses approximate gradients estimated from subsets of the training data and updates the parameters in an online fashion. This learning framework is attractive because it often requires much less training time in practice than batch training algorithms. However, L1-regularization, which is becoming popular in natural language processing because of its ability to produce compact models, cannot be efficiently applied in SGD training, due to the large dimensions of feature vectors and the fluctuations of approximate gradients. We present a simple method to solve these problems by penalizing the weights according to cumulative values for L1 penalty. We evaluate the effectiveness of our method in three applications: text chunking, named entity recognition, and part-of-speech tagging. Experimental results demonstrate that our method can produce compact and accurate models much more quickly than a state-of-the-art quasi-Newton method for L1-regularized loglinear models. 1
Learning to Rank with (a Lot of) Word Features
, 2009
"... In this article we present Supervised Semantic Indexing (SSI) which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our mod ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this article we present Supervised Semantic Indexing (SSI) which defines a class of nonlinear (quadratic) models that are discriminatively trained to directly map from the word content in a query-document or document-document pair to a ranking score. Like Latent Semantic Indexing (LSI), our models take account of correlations between words (synonymy, polysemy). However, unlike LSI our models are trained from a supervised signal directly on the ranking task of interest, which we argue is the reason for our superior results. As the query and target texts are modeled separately, our approach is easily generalized to different retrieval tasks, such as crosslanguage retrieval or online advertising placement. Dealing with models on all pairs of words features is computationally challenging. We propose several improvements to our basic model for addressing this issue, including low rank (but diagonal preserving) representations, correlated feature hashing (CFH) and sparsification. We provide an empirical study of all these methods on retrieval tasks based on Wikipedia documents as well as an Internet advertisement task. We obtain state-of-the-art performance while providing realistically scalable methods.
Online Learning for Group Lasso
"... We develop a novel online learning algorithm for the group lasso in order to efficiently find the important explanatory factors in a grouped manner. Different from traditional batch-mode group lasso algorithms, which suffer from the inefficiency and poor scalability, our proposed algorithm performs ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We develop a novel online learning algorithm for the group lasso in order to efficiently find the important explanatory factors in a grouped manner. Different from traditional batch-mode group lasso algorithms, which suffer from the inefficiency and poor scalability, our proposed algorithm performs in an online mode and scales well: at each iteration one can update the weight vector according to a closed-form solution based on the average of previous subgradients. Therefore, the proposed online algorithm can be very efficient and scalable. This is guaranteed by its low worst-case time complexity and memory cost both in the order of O(d), where d is the number of dimensions. Moreover, in order to achieve more sparsity in both the group level and the individual feature level, we successively extend our online system to efficiently solve a number of variants of sparse group lasso models. We also show that the online system is applicable to other group lasso models, such as the group lasso with overlap and graph lasso. Finally, we demonstrate the merits of our algorithm by experimenting with both synthetic and real-world datasets. 1.
Convergence rates of inexact proximal-gradient methods for convex optimization. arXiv:1109.2415v2
, 2011
"... We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that b ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
We consider the problem of optimizing the sum of a smooth convex function and a non-smooth convex function using proximal-gradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the non-smooth term. We show that both the basic proximal-gradient method and the accelerated proximal-gradient method achieve the same convergence rate as in the error-free case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems. 1

