Results 1  10
of
11
An Inexact Successive Quadratic Approximation Method for Convex L1 Regularized Optimization,” arXiv preprint arXiv:1309.3529
, 2013
"... Abstract We study a Newtonlike method for the minimization of an objective function φ that is the sum of a smooth convex function and an 1 regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadr ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract We study a Newtonlike method for the minimization of an objective function φ that is the sum of a smooth convex function and an 1 regularization term. This method, which is sometimes referred to in the literature as a proximal Newton method, computes a step by minimizing a piecewise quadratic model q k of the objective function φ. In order to make this approach efficient in practice, it is imperative to perform this inner minimization inexactly. In this paper, we give inexactness conditions that guarantee global convergence and that can be used to control the local rate of convergence of the iteration. Our inexactness conditions are based on a semismooth function that represents a (continuous) measure of the optimality conditions of the problem, and that embodies the softthresholding iteration. We give careful consideration to the algorithm employed for the inner minimization, and report numerical results on two test sets originating in machine learning.
BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
"... The `1regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under highdimensional settings. However, it requires solving a difficult nonsmooth logdeterminant program with number of param ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
The `1regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under highdimensional settings. However, it requires solving a difficult nonsmooth logdeterminant program with number of parameters scaling quadratically with the number of Gaussian variables. Stateoftheart methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm BIGQUIC, which can solve 1 million dimensional `1regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel blockcoordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that BIGQUIC can achieve superlinear or even quadratic convergence rates. 1
Communicationefficient distributed optimization of selfconcordant empirical loss. arXiv preprint arXiv:1501.00263,
, 2015
"... Abstract We consider distributed convex optimization problems originated from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, c ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
Abstract We consider distributed convex optimization problems originated from sample average approximation of stochastic optimization, or empirical risk minimization in machine learning. We assume that each machine in the distributed computing system has access to a local empirical loss function, constructed with i.i.d. data sampled from a common distribution. We propose a communicationefficient distributed algorithm to minimize the overall empirical loss, which is the average of the local empirical losses. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing selfconcordant empirical loss functions, and discuss the results for distributed ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning, the required number of communication rounds of the algorithm does not increase with the sample size, and only grows slowly with the number of machines.
Proximal QuasiNewton for Computationally Intensive `1regularized Mestimators
"... We consider the class of optimization problems arising from computationally intensive `1regularized Mestimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the `1regularized MLE for learning Conditional Random Fields (CRFs), which ar ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We consider the class of optimization problems arising from computationally intensive `1regularized Mestimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the `1regularized MLE for learning Conditional Random Fields (CRFs), which are a popular class of statistical models for varied structured prediction problems such as sequence labeling, alignment, and classification with label taxonomy. `1regularized MLEs for CRFs are particularly expensive to optimize since computing the gradient values requires an expensive inference step. In this work, we propose the use of a carefully constructed proximal quasiNewton algorithm for such computationally intensive Mestimation problems, where we employ an aggressive active set selection technique. In a key contribution of the paper, we show that the proximal quasiNewton method is provably superlinearly convergent, even in the absence of strong convexity, by leveraging a restricted variant of strong convexity. In our experiments, the proposed algorithm converges considerably faster than current stateoftheart on the problems of sequence labeling and hierarchical classification. 1
IMRO: a proximal quasiNewton method for solving l1regularized least square problem
, 2014
"... Abstract. We present a proximal quasiNewton method in which the approximation of the Hessian has the special format of “identity minus rank one ” (IMRO) in each iteration. The proposed structure enables us to effectively recover the proximal point. The algorithm is applied to l1regularized least ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present a proximal quasiNewton method in which the approximation of the Hessian has the special format of “identity minus rank one ” (IMRO) in each iteration. The proposed structure enables us to effectively recover the proximal point. The algorithm is applied to l1regularized least square problem arising in many applications including sparse recovery in compressive sensing, machine learning and statistics. Our numerical experiment suggests that the proposed technique competes favourably with other stateoftheart solvers for this class of problems. We also provide a complexity analysis for variants of IMRO, showing that it matches known best bounds.
Appendix: Proximal QuasiNewton for Computationally Intensive `1regularized
"... We rewrite the objective function here, min w f(w): = λ‖w‖1 + `(w), (1) Definition 1 (Constant Nullspace Strong Convexity). An composite function (1) is said to have Constant Nullspace Strong Convexity (CNSC) restricted to space T (CNSCT) iff there is a constant vector space T s.t. `(w) depends onl ..."
Abstract
 Add to MetaCart
(Show Context)
We rewrite the objective function here, min w f(w): = λ‖w‖1 + `(w), (1) Definition 1 (Constant Nullspace Strong Convexity). An composite function (1) is said to have Constant Nullspace Strong Convexity (CNSC) restricted to space T (CNSCT) iff there is a constant vector space T s.t. `(w) depends only on z = projT (w), i.e. `(w) = `(z), and its Hessian satisfies m‖v‖2 ≤ vTH(w)v ≤M‖v‖2, ∀v ∈ T,∀w ∈ Rd (2) for some M ≥ m> 0, and H(w)v = 0, ∀v ∈ T ⊥,∀w ∈ Rd, (3) where T ⊥ is the complementary space orthogonal to T. To exploit the CNSCT property, we first rebuild our problem and algorithm on the reduced space Z = {z ∈ Rd̂z = UTw}, where the strongconvexity property holds. Then we prove the asymptotic superlinear convergence on Z under the condition that the inner problem is solved exactly and no shrinking strategy is not applied. Finally we prove the objective (1) is bounded by the difference between current iterate and the optimal solution. In Section 1.5, we provide the global convergence proof when the shrinking strategy is applied. 1.1 Representing the problem in a reduced and compact space Properties of CNSCT condition For `(w) satisfying CNSCT condition, we have `(w) = `(projT (w)). Define g to be the gradient of `(w) and H to be the Hessian of `(w). As both g and H are in the T space, we have g(w) = UUTg(projT (w)) = g(projT (w)) and H(w) = UU
unknown title
"... Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 17.1 Recap: Proximal Gradient Descent Proximal gradient descent operates on problems of the form min x g(x) + ..."
Abstract
 Add to MetaCart
(Show Context)
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 17.1 Recap: Proximal Gradient Descent Proximal gradient descent operates on problems of the form min x g(x) + h(x), where g is convex, smooth, and h is convex and “simple ” (proximal operator is explicitly calculable). Choose initial x(0) and repeat for k = 1, 2, 3,... x(k) = proxtk x(k−1) − tk∇g(x(k−1)) where proxt(x) = argmin
ROBUST BLOCK COORDINATE DESCENT
, 2014
"... Abstract. In this paper we present a novel randomized block coordinate descent method for the minimization of a convex composite objective function. The method uses (approximate) partial secondorder (curvature) information, so that the algorithm performance is more robust when applied to highly non ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. In this paper we present a novel randomized block coordinate descent method for the minimization of a convex composite objective function. The method uses (approximate) partial secondorder (curvature) information, so that the algorithm performance is more robust when applied to highly nonseparable or ill conditioned problems. We call the method Robust Coordinate Descent (RCD). At each iteration of RCD, a block of coordinates is sampled randomly, a quadratic model is formed about that block and the model is minimized approximately/inexactly to determine the search direction. An inexpensive line search is then employed to ensure a monotonic decrease in the objective function and acceptance of large step sizes. We prove global convergence of the RCD algorithm, and we also present several results on the local convergence of RCD for strongly convex functions. Finally, we present numerical results on largescale problems to demonstrate the practical performance of the method.
Fast Convergence of Proximal Methods under HighDimensional Settings
"... State of the art statistical estimators for highdimensional problems take the form of regularized, and hence nonsmooth, convex programs. A key facet of these statistical estimation problems is that these are typically not strongly convex under a highdimensional sampling regime when the Hessian m ..."
Abstract
 Add to MetaCart
(Show Context)
State of the art statistical estimators for highdimensional problems take the form of regularized, and hence nonsmooth, convex programs. A key facet of these statistical estimation problems is that these are typically not strongly convex under a highdimensional sampling regime when the Hessian matrix becomes rankdeficient. Under vanilla convexity however, proximal optimization methods attain only a sublinear rate. In this paper, we investigate a novel variant of strong convexity, which we call Constant Nullspace Strong Convexity (CNSC), where we require that the objective function be strongly convex only over a constant subspace. As we show, the CNSC condition is naturally satisfied by highdimensional statistical estimators. We then analyze the behavior of proximal methods under this CNSC condition: we show global linear convergence of Proximal Gradient and local quadratic convergence of Proximal Newton Method, when the regularization function comprising the statistical estimator is decomposable. We corroborate our theory via numerical experiments, and show a qualitative difference in the convergence rates of the proximal algorithms when the loss function does satisfy the CNSC condition. 1