#### DMCA

## Convex Optimization with Sparsity-Inducing Norms

### Cached

### Download Links

Citations: | 78 - 13 self |

### Citations

7184 | Convex Optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...). 1.1.2 Optimization Tools The tools used in this book chapter are relatively basic and should be accessible to a broad audience. Most of them can be found in classical books on convex optimization (=-=Boyd and Vandenberghe, 2004-=-; Bertsekas, 1999; BorweinandLewis,2006;NocedalandWright,2006),butforself-containedness, we present here a few of them related to non-smooth unconstrained optimization. Subgradients Given a convex fun... |

5225 | Convex Analysis - Rockafellar - 1972 |

3220 | Numerical Optimization
- Nocedal, Wright
- 1999
(Show Context)
Citation Context ...hted-ℓ2 Algorithms Approximatinganonsmoothorconstrainedoptimizationproblembyaseries of smooth unconstrained problems is common in optimization (see, e.g., Nesterov, 2005; Boyd and Vandenberghe, 2004; =-=Nocedal and Wright, 2006-=-). In the context of objective functions regularized by sparsity-inducing norms, it is natural to consider variational formulations of these norms in terms of squared ℓ2-norms, since many efficient me... |

2681 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1998
(Show Context)
Citation Context ... As a consequence, the vector 0 is solution if and only if Ω ∗( ∇f(0) ) ≤ λ. These general optimality conditions can be specified to the Lasso problem (Tibshirani, 1996), also known as basis pursuit (=-=Chen et al., 1999-=-): min w∈Rp 1 2 ‖y −Xw‖22 +λ‖w‖1, (1.5) where y is in R n , and X is a design matrix in R n×p . From Equation (1.4) and since the ℓ∞-norm is the dual of the ℓ1-norm we obtain that necessary and suffic... |

2666 | Portfolio Selection - Markowitz - 1952 |

1126 | Model selection and estimation in regression with grouped variables
- Yuan, Lin
- 2007
(Show Context)
Citation Context ...y all the variables forming a group. A regularization norm exploiting explicitly this group structure can be shown to improve the prediction performance and/or interpretability of the learned models (=-=Yuan and Lin, 2006-=-; Roth and Fischer, 2008; Huang and Zhang, 2009; Obozinski et al., 2009; Lounici et al., 2009). Such a norm might for instance take the form Ω(w) := ∑ dg‖wg‖2, (1.2) g∈G whereGisapartitionof{1,...,p},... |

1122 |
Nonlinear Programming. Athena Scientific
- Bertsekas
- 1999
(Show Context)
Citation Context ...he tools used in this book chapter are relatively basic and should be accessible to a broad audience. Most of them can be found in classical books on convex optimization (Boyd and Vandenberghe, 2004; =-=Bertsekas, 1999-=-; BorweinandLewis,2006;NocedalandWright,2006),butforself-containedness, we present here a few of them related to non-smooth unconstrained optimization. Subgradients Given a convex function g : R p → R... |

1025 | A fast iterative shrinkage-thresholding algorithm for linear inverse problems - Beck, Teboulle - 2009 |

983 | Adapting to unknown smoothness via wavelet shrinkage - Donoho, Johnstone - 1995 |

548 | Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization
- Recht, Fazel, et al.
(Show Context)
Citation Context ...th mentioning. In terms of norms, we did not consider regularization by the nuclear norm, also known as the trace norm, which seeks low-rank matrix solutions (Fazel et al., 2001; Srebro et al., 2005; =-=Recht et al., 2007-=-; Bach, 2008b). Most of the optimization techniques that we presented do however apply to this norm (with the exception of coordinate descent). In terms of algorithms, it is possible to relax the smoo... |

498 | Smooth minimization of non-smooth functions - Nesterov |

468 |
Hierarchical clustering schemes
- Johnson
- 1967
(Show Context)
Citation Context ... and 26 classes. In addition, both datasets exhibit highly-correlated features. Inspired by Kim and Xing (2010), we build a tree-structured set of groups G by applying Ward’s hierarchical clustering (=-=Johnson, 1967-=-) on the gene expressions. The norm Ω built that way aims at capturing the hierarchical structure of gene expression networks (Kim and Xing, 2010). Instead of the square loss function, we consider the... |

450 | Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics - Bickel, Ritov, et al. - 2009 |

445 | On model selection consistency of lasso - Zhao, Yu - 2006 |

285 | Adaptive subgradient methods for online learning and stochastic optimization - Duchi, Hazan, et al. |

262 | A rank minimization heuristic with application to minimum order system approximation
- Fazel, Hindi, et al.
(Show Context)
Citation Context ...e can reformulate 1 minw∈Rp 2n‖y −Xw‖22 +λΩ(w) as min w+,w−∈R p + 1 2n ‖y −Xw+ +Xw−‖ 2 2 +λ(1 ⊤ w+ +1 ⊤ w−) which is a quadratic program. Other problems can be similarly cast (for the trace norm, see =-=Fazel et al., 2001-=-; Bach, 2008b). General-purpose toolboxes can then be used, to get solutions with high precision (low duality gap). However, in the context of machine learning, this is inefficient for two reasons: (1... |

251 | O.: The tradeoffs of large scale learning - Bottou, Bousquet - 2008 |

233 | Multi-task feature learning
- Argyriou, Evgeniou, et al.
- 2006
(Show Context)
Citation Context ...lve ℓ2regularizedproblems(e.g.,linearsystemsolversforleast-squaresregression). In this section, we show on our motivating example of sums of ℓ2-norms of subsets how such formulations arise (see, e.g. =-=Pontil et al., 2007-=-; Rakotomamonjy et al., 2008; Jenatton et al., 2010b; Daubechies et al., 2010). Variational formulation for sums of ℓ2-norms. A simple application of Cauchy-Schwarz inequality and the inequality √ ab ... |

228 | Convex Analysis and NonLinear Optimization. Theory and Exmaples
- Borwein, Lewis
- 2000
(Show Context)
Citation Context ...on 7 Then, ιΩ is a convex function and its conjugate is exactly the dual norm Ω ∗ . For many objective functions, the Fenchel conjugate admits closed forms, and can therefore be computed efficiently (=-=Borwein and Lewis, 2006-=-). Then, it is possible to derive a duality gap for problem (1.1) from standard Fenchel duality arguments (see Borwein and Lewis, 2006), as shown below: Proposition 1.2 (Duality for Problem (1.1)). If... |

221 | Group Lasso with overlap and graph Lasso
- Jacob, Obozinski, et al.
- 2009
(Show Context)
Citation Context ...ting, and structured parsimony has emerged as a natural extension,withapplicationstocomputervision(Jenattonetal.,2010b),text processing (Jenatton et al., 2010a) or bioinformatics (Kim and Xing, 2010; =-=Jacob et al., 2009-=-). Structured sparsity may be achieved by regularizing by other norms than the ℓ1-norm. In this chapter, we focus primarily on norms which can be written as linear combinations of norms on subsets of ... |

202 | A unified framework for high-dimensional analysis of M -estimators with decomposable regularizers,”
- Negahban, Ravikumar, et al.
- 2012
(Show Context)
Citation Context ...is chapter focuses primarily on these. Second, it allows a fruitful theoretical analysis answering fundamental questions related to estimation consistency, prediction efficiency (Bickel et al., 2009; =-=Negahban et al., 2009-=-) or model consistency (Zhao and Yu, 2006; Wainwright, 2009). In particular, when the sparse model is assumed to be well-specified, regularization by the ℓ1-norm is adapted to high-dimensional problem... |

186 | Tallec, Augmented Lagrangian and Operator-Splitting - Glowinski, Le - 1989 |

185 | Structured variable selection with sparsityinducing norms. arXiv:0904.3523 - Jenatton, Audibert, et al. - 2009 |

184 | Large-scale Bayesian logistic regression for text categorization - Genkin, Lewis, et al. - 2007 |

181 | Penalized regressions: the bridge versus the Lasso - Fu - 1998 |

154 | Iteratively reweighted least squares minimization for sparse recovery - Daubechies, DeVore, et al. - 2010 |

153 | A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming B, 2007. submitted. A Notations A.1 Multibody system – n: number of bodies. – m = 6n: dimension of the position state vector. – M : Mass matrix, positive d - Tseng, Yun |

143 | The composite absolute penalties family for grouped and hierarchical variable selection - Zhao, Rocha, et al. |

128 | Joint Covariate Selection and Joint Subspace Selection for Multiple Classification Problems. Stat Comput - Obozinksi, Taskar, et al. - 2009 |

124 | Proximal methods for sparse hierarchical dictionary learning - Jenatton, Mairal, et al. - 2010 |

123 | Fonctions convexes duales et points proximaux dans un espace Hilbertien - Moreau - 1962 |

115 | The benefit of group sparsity - Huang, Zhang - 2010 |

110 | Tree-guided group lasso for multi-task regression with structured sparsity
- Kim, Xing
- 2010
(Show Context)
Citation Context ...turns out to be limiting, and structured parsimony has emerged as a natural extension,withapplicationstocomputervision(Jenattonetal.,2010b),text processing (Jenatton et al., 2010a) or bioinformatics (=-=Kim and Xing, 2010-=-; Jacob et al., 2009). Structured sparsity may be achieved by regularizing by other norms than the ℓ1-norm. In this chapter, we focus primarily on norms which can be written as linear combinations of ... |

109 | Exploring large feature space with hierarchical multiple kernel learning - Bach |

107 | Sharp thresholds for noisy and high-dimensional recovery of sparsity using ℓ1-constrained quadratic programming (lasso - Wainwright - 2009 |

105 | Coordinate descent algorithms for lasso penalized regression - Wu, Lange |

104 | A simple and efficient algorithm for gene selection using sparse logistic regression - Shevade, Keerthi - 2003 |

89 | Lectures on Stochastic Programming: Modeling and Theory - Shapiro, Dentcheva, et al. - 2009 |

83 | 2007, Fast optimization methods for l1 regularization: A comparative study and two new approaches - Schmidt, Fung, et al. |

80 |
Online algorithms and stochastic approximations
- Bottou
- 1998
(Show Context)
Citation Context ...2009). Finally, from a broader outlook, our —a priori deterministic— optimization problem (1.1) may also be tackled with stochastic optimization approaches, which has been the focus of much research (=-=Bottou, 1998-=-; Bottou and LeCun, 2004; Shapiro et al., 2009). 1.9 Conclusion We presented and compared four families of algorithms for sparse methods: proximal methods, block-coordinate descent algorithms, reweigh... |

78 |
de Geer. Taking advantage of sparsity in Multi-Task learning
- Lounici, Pontil, et al.
- 2009
(Show Context)
Citation Context ... structure can be shown to improve the prediction performance and/or interpretability of the learned models (Yuan and Lin, 2006; Roth and Fischer, 2008; Huang and Zhang, 2009; Obozinski et al., 2009; =-=Lounici et al., 2009-=-). Such a norm might for instance take the form Ω(w) := ∑ dg‖wg‖2, (1.2) g∈G whereGisapartitionof{1,...,p},(dg)g∈G aresomepositiveweights,andwg denotes the vector in R |g| recording the coefficients o... |

74 | Consistency of trace norm minimization - Bach |

72 | Y.: Large scale online learning
- Bottou, LeCun
- 2004
(Show Context)
Citation Context ..., from a broader outlook, our —a priori deterministic— optimization problem (1.1) may also be tackled with stochastic optimization approaches, which has been the focus of much research (Bottou, 1998; =-=Bottou and LeCun, 2004-=-; Shapiro et al., 2009). 1.9 Conclusion We presented and compared four families of algorithms for sparse methods: proximal methods, block-coordinate descent algorithms, reweighted-ℓ2 algorithms and th... |

68 | A note on the group lasso and the sparse group lasso
- Friedman, Hastie, et al.
- 2010
(Show Context)
Citation Context ...ned ℓ1 +ℓ1/ℓq-norm (“sparse group Lasso”). The possibility of combining an ℓ1/ℓq-norm that takes advantage of sparsity at the group levelwithanℓ1-normthatinducessparsitywithinthegroupsisquitenatural (=-=Friedman et al., 2010-=-; Sprechmann et al., 2010). Such regularizations are in fact a special case of the hierarchical ℓ1/ℓq-norms presented above and the corresponding proximal operator is therefore readily computed by app... |

67 | Structured sparse principal component analysis - Jenatton, Obozinski, et al. |

56 |
An O(n) algorithm for quadratic knapsack problems
- Brucker
- 1984
(Show Context)
Citation Context ...u) = +∞ otherwise. Proximal methods thus apply and the corresponding proximal operator is the projection on the ℓ1ball, for which efficient pivot algorithms with linear complexity have been proposed (=-=Brucker, 1984-=-; Maculan and Galdino de Paula, 1989). ℓ1/ℓq-norm (“group Lasso”). If G is a partition of {1,...,p}, the dual 1 1 norm of the ℓ1/ℓq norm is the ℓ∞/ℓq ′ norm, with q + q ′ =1. It is easy to show that t... |

53 | Network flow algorithms for structured sparsity - Mairal, Jenatton, et al. - 2010 |

51 | An interior-point method for large-scale `1-regularized least squares - Kim, Koh, et al. - 2007 |

28 | Convex structure learning in log-linear models: Beyond pairwise potentials
- Schmidt, Murphy
- 2010
(Show Context)
Citation Context ...the problem at hand),recentresearchhasexploredthesettingwhereGcancontaingroupsof variables that overlap (Zhao et al., 2009; Bach, 2008a; Jenatton et al., 2009; Jacob et al., 2009; Kim and Xing, 2010; =-=Schmidt and Murphy, 2010-=-). In this case, Ω is still a norm, and it yields sparsity in the form of specific patterns of variables. More precisely, the solutions w ⋆ of problem (1.1) can be shown to have a set of zero coeffici... |

15 | Hierarchical penalization
- Szafranski, Grandvalet, et al.
(Show Context)
Citation Context ...ption is met, it is easy to see that these procedures stop in a finite number of iterations. This class of algorithms takes advantage of sparsity from a computational point of view (Lee et al., 2007; =-=Szafranski et al., 2007-=-; Bach, 2008a; Roth and Fischer, 2008; Obozinski et al., 2009; Jenatton et al., 2009; Schmidt and Murphy, 2010), since the subproblems that need to be solved are typically much smaller than the origin... |

7 |
Fixed-Point Algorithms for Inverse
- Combettes, Pesquet
- 2011
(Show Context)
Citation Context ...eir convergence rates (optimal for the class of firstorder techniques) and their ability to deal with large nonsmooth convex problems (e.g., Nesterov 2007; Beck and Teboulle 2009; Wright et al. 2009; =-=Combettes and Pesquet 2010-=-). Proximal methods can be described as follows: at each iteration the10 Convex Optimization with Sparsity-Inducing Norms function f is linearized around the current point and a problem of the form m... |

5 | Galdino de Paula. A linear-time median-finding algorithm for projecting a vector on the simplex of R n - Maculan, G - 1989 |

3 | et al. Sparse multinomial logistic regression: Fast algorithms and generalization bounds - Krishnapuram, Carin - 2005 |

2 |
Collaborative Hierarchical Sparse Modeling. Arxiv preprint arXiv:1003.0400
- Sprechmann, Ramirez, et al.
- 2010
(Show Context)
Citation Context ...arse group Lasso”). The possibility of combining an ℓ1/ℓq-norm that takes advantage of sparsity at the group levelwithanℓ1-normthatinducessparsitywithinthegroupsisquitenatural (Friedman et al., 2010; =-=Sprechmann et al., 2010-=-). Such regularizations are in fact a special case of the hierarchical ℓ1/ℓq-norms presented above and the corresponding proximal operator is therefore readily computed by applying first soft-threshol... |

1 | Greed is good: Algorithmic results for sparse approximation - Soc - 1996 |

1 | Sparse reconstruction by 35 separable approximation - Wright, Nowak, et al. |