Results 1 
8 of
8
BIG & QUIC: Sparse Inverse Covariance Estimation for a Million Variables
"... The `1regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under highdimensional settings. However, it requires solving a difficult nonsmooth logdeterminant program with number of param ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
The `1regularized Gaussian maximum likelihood estimator (MLE) has been shown to have strong statistical guarantees in recovering a sparse inverse covariance matrix even under highdimensional settings. However, it requires solving a difficult nonsmooth logdeterminant program with number of parameters scaling quadratically with the number of Gaussian variables. Stateoftheart methods thus do not scale to problems with more than 20, 000 variables. In this paper, we develop an algorithm BIGQUIC, which can solve 1 million dimensional `1regularized Gaussian MLE problems (which would thus have 1000 billion parameters) using a single machine, with bounded memory. In order to do so, we carefully exploit the underlying structure of the problem. Our innovations include a novel blockcoordinate descent method with the blocks chosen via a clustering scheme to minimize repeated computations; and allowing for inexact computation of specific components. In spite of these modifications, we are able to theoretically analyze our procedure and show that BIGQUIC can achieve superlinear or even quadratic convergence rates. 1
Proximal QuasiNewton for Computationally Intensive `1regularized Mestimators
"... We consider the class of optimization problems arising from computationally intensive `1regularized Mestimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the `1regularized MLE for learning Conditional Random Fields (CRFs), which ar ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
We consider the class of optimization problems arising from computationally intensive `1regularized Mestimators, where the function or gradient values are very expensive to compute. A particular instance of interest is the `1regularized MLE for learning Conditional Random Fields (CRFs), which are a popular class of statistical models for varied structured prediction problems such as sequence labeling, alignment, and classification with label taxonomy. `1regularized MLEs for CRFs are particularly expensive to optimize since computing the gradient values requires an expensive inference step. In this work, we propose the use of a carefully constructed proximal quasiNewton algorithm for such computationally intensive Mestimation problems, where we employ an aggressive active set selection technique. In a key contribution of the paper, we show that the proximal quasiNewton method is provably superlinearly convergent, even in the absence of strong convexity, by leveraging a restricted variant of strong convexity. In our experiments, the proposed algorithm converges considerably faster than current stateoftheart on the problems of sequence labeling and hierarchical classification. 1
IMRO: a proximal quasiNewton method for solving l1regularized least square problem
, 2014
"... Abstract. We present a proximal quasiNewton method in which the approximation of the Hessian has the special format of “identity minus rank one ” (IMRO) in each iteration. The proposed structure enables us to effectively recover the proximal point. The algorithm is applied to l1regularized least ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present a proximal quasiNewton method in which the approximation of the Hessian has the special format of “identity minus rank one ” (IMRO) in each iteration. The proposed structure enables us to effectively recover the proximal point. The algorithm is applied to l1regularized least square problem arising in many applications including sparse recovery in compressive sensing, machine learning and statistics. Our numerical experiment suggests that the proposed technique competes favourably with other stateoftheart solvers for this class of problems. We also provide a complexity analysis for variants of IMRO, showing that it matches known best bounds.
Appendix: Proximal QuasiNewton for Computationally Intensive `1regularized
"... We rewrite the objective function here, min w f(w): = λ‖w‖1 + `(w), (1) Definition 1 (Constant Nullspace Strong Convexity). An composite function (1) is said to have Constant Nullspace Strong Convexity (CNSC) restricted to space T (CNSCT) iff there is a constant vector space T s.t. `(w) depends onl ..."
Abstract
 Add to MetaCart
(Show Context)
We rewrite the objective function here, min w f(w): = λ‖w‖1 + `(w), (1) Definition 1 (Constant Nullspace Strong Convexity). An composite function (1) is said to have Constant Nullspace Strong Convexity (CNSC) restricted to space T (CNSCT) iff there is a constant vector space T s.t. `(w) depends only on z = projT (w), i.e. `(w) = `(z), and its Hessian satisfies m‖v‖2 ≤ vTH(w)v ≤M‖v‖2, ∀v ∈ T,∀w ∈ Rd (2) for some M ≥ m> 0, and H(w)v = 0, ∀v ∈ T ⊥,∀w ∈ Rd, (3) where T ⊥ is the complementary space orthogonal to T. To exploit the CNSCT property, we first rebuild our problem and algorithm on the reduced space Z = {z ∈ Rd̂z = UTw}, where the strongconvexity property holds. Then we prove the asymptotic superlinear convergence on Z under the condition that the inner problem is solved exactly and no shrinking strategy is not applied. Finally we prove the objective (1) is bounded by the difference between current iterate and the optimal solution. In Section 1.5, we provide the global convergence proof when the shrinking strategy is applied. 1.1 Representing the problem in a reduced and compact space Properties of CNSCT condition For `(w) satisfying CNSCT condition, we have `(w) = `(projT (w)). Define g to be the gradient of `(w) and H to be the Hessian of `(w). As both g and H are in the T space, we have g(w) = UUTg(projT (w)) = g(projT (w)) and H(w) = UU
unknown title
"... Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 17.1 Recap: Proximal Gradient Descent Proximal gradient descent operates on problems of the form min x g(x) + ..."
Abstract
 Add to MetaCart
(Show Context)
Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 17.1 Recap: Proximal Gradient Descent Proximal gradient descent operates on problems of the form min x g(x) + h(x), where g is convex, smooth, and h is convex and “simple ” (proximal operator is explicitly calculable). Choose initial x(0) and repeat for k = 1, 2, 3,... x(k) = proxtk x(k−1) − tk∇g(x(k−1)) where proxt(x) = argmin
ROBUST BLOCK COORDINATE DESCENT
, 2014
"... Abstract. In this paper we present a novel randomized block coordinate descent method for the minimization of a convex composite objective function. The method uses (approximate) partial secondorder (curvature) information, so that the algorithm performance is more robust when applied to highly non ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. In this paper we present a novel randomized block coordinate descent method for the minimization of a convex composite objective function. The method uses (approximate) partial secondorder (curvature) information, so that the algorithm performance is more robust when applied to highly nonseparable or ill conditioned problems. We call the method Robust Coordinate Descent (RCD). At each iteration of RCD, a block of coordinates is sampled randomly, a quadratic model is formed about that block and the model is minimized approximately/inexactly to determine the search direction. An inexpensive line search is then employed to ensure a monotonic decrease in the objective function and acceptance of large step sizes. We prove global convergence of the RCD algorithm, and we also present several results on the local convergence of RCD for strongly convex functions. Finally, we present numerical results on largescale problems to demonstrate the practical performance of the method.
Fast Convergence of Proximal Methods under HighDimensional Settings
"... State of the art statistical estimators for highdimensional problems take the form of regularized, and hence nonsmooth, convex programs. A key facet of these statistical estimation problems is that these are typically not strongly convex under a highdimensional sampling regime when the Hessian m ..."
Abstract
 Add to MetaCart
(Show Context)
State of the art statistical estimators for highdimensional problems take the form of regularized, and hence nonsmooth, convex programs. A key facet of these statistical estimation problems is that these are typically not strongly convex under a highdimensional sampling regime when the Hessian matrix becomes rankdeficient. Under vanilla convexity however, proximal optimization methods attain only a sublinear rate. In this paper, we investigate a novel variant of strong convexity, which we call Constant Nullspace Strong Convexity (CNSC), where we require that the objective function be strongly convex only over a constant subspace. As we show, the CNSC condition is naturally satisfied by highdimensional statistical estimators. We then analyze the behavior of proximal methods under this CNSC condition: we show global linear convergence of Proximal Gradient and local quadratic convergence of Proximal Newton Method, when the regularization function comprising the statistical estimator is decomposable. We corroborate our theory via numerical experiments, and show a qualitative difference in the convergence rates of the proximal algorithms when the loss function does satisfy the CNSC condition. 1