Results 1  10
of
39
A unified framework for highdimensional analysis of Mestimators with decomposable regularizers
"... ..."
Learning with Structured Sparsity
"... This paper investigates a new learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea. A gener ..."
Abstract

Cited by 127 (15 self)
 Add to MetaCart
This paper investigates a new learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. Experiments demonstrate the advantage of structured sparsity over standard sparsity. 1.
Highdimensional additive modeling
 Annals of Statistics
"... We propose a new sparsitysmoothness penalty for highdimensional generalized additive models. The combination of sparsity and smoothness is crucial for mathematical theory as well as performance for finitesample data. We present a computationally efficient algorithm, with provable numerical conver ..."
Abstract

Cited by 78 (3 self)
 Add to MetaCart
(Show Context)
We propose a new sparsitysmoothness penalty for highdimensional generalized additive models. The combination of sparsity and smoothness is crucial for mathematical theory as well as performance for finitesample data. We present a computationally efficient algorithm, with provable numerical convergence properties, for optimizing the penalized likelihood. Furthermore, we provide oracle results which yield asymptotic optimality of our estimator for highdimensional but sparse additive models. Finally, an adaptive version of our sparsitysmoothness penalized approach yields large additional performance gains. 1
Minimaxoptimal rates for sparse additive models over kernel classes via convex programming
"... Sparse additive models are families of dvariate functions with the additive decomposition f ∗ = ∑ j∈S f ∗ j, where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f ∗ j lies in a reproducing kernel Hilbert space (RKHS), and ..."
Abstract

Cited by 52 (8 self)
 Add to MetaCart
Sparse additive models are families of dvariate functions with the additive decomposition f ∗ = ∑ j∈S f ∗ j, where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f ∗ j lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1type convex regularization. Working within a highdimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn) norms over the classF d,s,H of sparse additive models with each univariate function f ∗ j in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the dvariate function class is assumed to be globally bounded, then much faster estimation rates are possible for any sparsity s=Ω ( √ n), showing that global boundedness is a significant restriction in the highdimensional setting.
Generalization bounds for learning kernels
 In ICML ’10,2010
"... This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L1 regularization admits o ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L1 regularization admits only a √ log p dependency on the number of kernels, which is tight and considerably more favorable than the previous best bound given for the same problem. We also give a novel bound for learning with a nonnegative combination of p base kernels with an L2 regularization whose dependency on p is also tight and only in p 1/4. We present similar results for Lq regularization with other values of q, and outline the relevance of our proof techniques to the analysis of the complexity of the class of linear functions. Experiments with a large number of kernels further validate the behavior of the generalization error as a function of p predicted by our bounds.
HighDimensional NonLinear Variable Selection through Hierarchical Kernel Learning
, 2009
"... We consider the problem of highdimensional nonlinear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize nonlinear interactions between the original variables. T ..."
Abstract

Cited by 27 (5 self)
 Add to MetaCart
We consider the problem of highdimensional nonlinear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize nonlinear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graphadapted sparsityinducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in highdimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show stateoftheart predictive performance for nonlinear regression problems. 1
Taking Advantage of Sparsity in MultiTask Learning
"... We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multitask learning [1], we assume that the sparsity patterns of the regression vectors are included in the same set of small cardinality. This ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multitask learning [1], we assume that the sparsity patterns of the regression vectors are included in the same set of small cardinality. This assumption leads us to consider the Group Lasso as a candidate estimation method. We show that this estimator enjoys nice sparsity oracle inequalities and variable selection properties. The results hold under a certain restricted eigenvalue condition and a coherence condition on the design matrix, which naturally extend recent work in [3, 19]. In particular, in the multitask learning scenario, in which the number of tasks can grow, we are able to remove completely the effect of the number of predictor variables in the bounds. Finally, we show how our results can be extended to more general noise distributions, of which we only require the variance to be finite. 1 1
Algorithms for Learning Kernels Based on Centered Alignment
"... This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the socalled uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for le ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the socalled uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a onestage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignmentbased regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stabilitybased generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignmentbased algorithms in both classification and regression.
Fast learning rate of multiple kernel learning: Tradeoff between sparsity and smoothness. The Annals of Statistics
, 2013
"... We investigate the learning rate of multiple kernel leaning (MKL) with ℓ1 and elasticnet regularizations. The elasticnet regularization is a composition of an ℓ1regularizer for inducing the sparsity and an ℓ2regularizer for controlling the smoothness. We focus on a sparse setting where the total ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
We investigate the learning rate of multiple kernel leaning (MKL) with ℓ1 and elasticnet regularizations. The elasticnet regularization is a composition of an ℓ1regularizer for inducing the sparsity and an ℓ2regularizer for controlling the smoothness. We focus on a sparse setting where the total number of kernels is large but the number of nonzero components of the ground truth is relatively small, and show sharper convergence rates than the learning rates ever shown for both ℓ1 and elasticnet regularizations. Our analysis shows there appears a tradeoff between the sparsity and the smoothness when it comes to selecting which of ℓ1 and elasticnet regularizations to use; if the ground truth is smooth, the elasticnet regularization is preferred, otherwise the ℓ1 regularization is preferred. 1