Results 1  10
of
81
Structured variable selection with sparsityinducing norms
, 2011
"... We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to ov ..."
Abstract

Cited by 187 (27 self)
 Add to MetaCart
(Show Context)
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for leastsquares linear regression in low and highdimensional settings.
Proximal Methods for Hierarchical Sparse Coding
, 2010
"... Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced treestructured sparse regularizatio ..."
Abstract

Cited by 83 (18 self)
 Add to MetaCart
(Show Context)
Sparse coding consists in representing signals as sparse linear combinations of atoms selected from a dictionary. We consider an extension of this framework where the atoms are further assumed to be embedded in a tree. This is achieved using a recently introduced treestructured sparse regularization norm, which has proven useful in several applications. This norm leads to regularized problems that are difficult to optimize, and we propose in this paper efficient algorithms for solving them. More precisely, we show that the proximal operator associated with this norm is computable exactly via a dual approach that can be viewed as the composition of elementary proximal operators. Our procedure has a complexity linear, or close to linear, in the number of atoms, and allows the use of accelerated gradient techniques to solve the treestructured sparse approximation problem at the same computational cost as traditional ones using the ℓ1norm. Our method is efficient and scales gracefully to millions of variables, which we illustrate in two types of applications: first, we consider fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we apply our optimization tools in the context of dictionary learning, where learned dictionary elements naturally organize in a prespecified arborescent structure, leading to a better performance in reconstruction of natural image patches. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models.
Convergence Rates of Inexact ProximalGradient Methods for Convex Optimization
 NIPS'11 25 TH ANNUAL CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS
, 2011
"... We consider the problem of optimizing the sum of a smooth convex function and a nonsmooth convex function using proximalgradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the nonsmooth term. We show that b ..."
Abstract

Cited by 49 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of optimizing the sum of a smooth convex function and a nonsmooth convex function using proximalgradient methods, where an error is present in the calculation of the gradient of the smooth term or in the proximity operator with respect to the nonsmooth term. We show that both the basic proximalgradient method and the accelerated proximalgradient method achieve the same convergence rate as in the errorfree case, provided that the errors decrease at appropriate rates. Using these rates, we perform as well as or better than a carefully chosen fixed error level on a set of structured sparsity problems.
Convex and network flow optimization for structured sparsity
 JMLR
, 2011
"... We consider a class of learning problems regularized by a structured sparsityinducing norm defined as the sum of ℓ2 or ℓ∞norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address ..."
Abstract

Cited by 35 (8 self)
 Add to MetaCart
We consider a class of learning problems regularized by a structured sparsityinducing norm defined as the sum of ℓ2 or ℓ∞norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞norms can be computed exactly in polynomial time by solving a quadratic mincost flow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with nonoverlapping groups, but in higher dimension and with additional constraints. We propose efficient and scalable algorithms exploiting these two strategies, which are significantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multitask learning of treestructured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches.
Trace lasso: A trace norm regularization for correlated designs
 In Advances in Neural Information Processing Systems 24
, 2011
"... Using the `1norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
Using the `1norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm, which is a convex surrogate of the rank, of the selected covariates as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted leastsquares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1
Efficient Sparse Modeling with Automatic Feature Grouping
"... The grouping of features is highly beneficial in learning with highdimensional data. It reduces the variance in the estimation and improves the stability of feature selection, leading to improved generalization. Moreover, it can also help in data understanding and interpretation. OSCAR is a recent ..."
Abstract

Cited by 20 (0 self)
 Add to MetaCart
(Show Context)
The grouping of features is highly beneficial in learning with highdimensional data. It reduces the variance in the estimation and improves the stability of feature selection, leading to improved generalization. Moreover, it can also help in data understanding and interpretation. OSCAR is a recent sparse modeling tool that achieves this by using a ℓ1regularizer and a pairwise ℓ∞regularizer. However, its optimization is computationally expensive. In this paper, we propose an efficient solver based on the accelerated gradient methods. We show that its key projection step can be solved by a simple iterative group merging algorithm. It is highly efficient and reduces the empirical time complexity from O(d3 ∼ d5) for the existing solvers to just O(d), where d is the number of features. Experimental results on toy and realworld data sets demonstrate that OSCAR is a competitive sparse modeling approach with the added ability of automatic feature grouping. 1.
MULTISCALE MINING OF FMRI DATA WITH HIERARCHICAL STRUCTURED SPARSITY
, 2011
"... Abstract. Reverse inference, or “brain reading”, is a recent paradigm for analyzing functional magnetic resonance imaging (fMRI) data, based on pattern recognition and statistical learning. By predicting some cognitive variables related to brain activation maps, this approach aims at decoding brain ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
(Show Context)
Abstract. Reverse inference, or “brain reading”, is a recent paradigm for analyzing functional magnetic resonance imaging (fMRI) data, based on pattern recognition and statistical learning. By predicting some cognitive variables related to brain activation maps, this approach aims at decoding brain activity. Reverse inference takes into account the multivariate information between voxels and is currently the only way to assess how precisely some cognitive information is encoded by the activity of neural populations within the whole brain. However, it relies on a prediction function that is plagued by the curse of dimensionality, since there are far more features than samples, i.e., more voxels than fMRI volumes. To address this problem, different methods have been proposed, such as, among others, univariate feature selection, feature agglomeration and regularization techniques. In this paper, we consider a sparse hierarchical structured regularization. Specifically, the penalization we use is constructed from a tree that is obtained by spatiallyconstrained agglomerative clustering. This approach encodes the spatial structure of the data at different scales into the regularization, which makes the overall prediction procedure more robust to intersubject variability. The regularization used induces the selection of spatially coherent predictive brain regions simultaneously at different scales. We test our algorithm on real data acquired to study the mental representation of objects, and we show that the proposed algorithm not only delineates meaningful brain regions but yields as well better prediction accuracy than reference methods.
LOWRANK OPTIMIZATION WITH TRACE NORM PENALTY∗
"... Abstract. The paper addresses the problem of lowrank trace norm minimization. We propose an algorithm that alternates between fixedrank optimization and rankone updates. The fixedrank optimization is characterized by an efficient factorization that makes the trace norm differentiable in the sear ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
Abstract. The paper addresses the problem of lowrank trace norm minimization. We propose an algorithm that alternates between fixedrank optimization and rankone updates. The fixedrank optimization is characterized by an efficient factorization that makes the trace norm differentiable in the search space and the computation of duality gap numerically tractable. The search space is nonlinear but is equipped with a Riemannian structure that leads to efficient computations. We present a secondorder trustregion algorithm with a guaranteed quadratic rate of convergence. Overall, the proposed optimization scheme converges superlinearly to the global solution while maintaining complexity that is linear in the number of rows and columns of the matrix. To compute a set of solutions efficiently for a grid of regularization parameters we propose a predictorcorrector approach that outperforms the naive warmrestart approach on the fixedrank quotient manifold. The performance of the proposed algorithm is illustrated on problems of lowrank matrix completion and multivariate linear regression.
A.: Parametric local metric learning for nearest neighbor classification
 NIPS 2012 ECML/PKDD 2013 MACHINE LEARNING LAB 9/19/13 31/34 ECML/PKDD 2013
"... We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence ” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We p ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
We study the problem of learning local metrics for nearest neighbor classification. Most previous works on local metric learning learn a number of local unrelated metrics. While this ”independence ” approach delivers an increased flexibility its downside is the considerable risk of overfitting. We present a new parametric local metric learning method in which we learn a smooth metric matrix function over the data manifold. Using an approximation error bound of the metric matrix function we learn local metrics as linear combinations of basis metrics defined on anchor points over different regions of the instance space. We constrain the metric matrix function by imposing on the linear combinations manifold regularization which makes the learned metric matrix function vary smoothly along the geodesics of the data manifold. Our metric learning method has excellent performance both in terms of predictive power and scalability. We experimented with several largescale classification problems, tens of thousands of instances, and compared it with several state of the art metric learning methods, both global and local, as well as to SVM with automatic kernel selection, all of which it outperforms in a significant manner. 1