Results 1 - 10
of
39
A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers
"... ..."
Learning with Structured Sparsity
"... This paper investigates a new learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea. A gener ..."
Abstract
-
Cited by 127 (15 self)
- Add to MetaCart
This paper investigates a new learning formulation called structured sparsity, which is a natural extension of the standard sparsity concept in statistical learning and compressive sensing. By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity idea. A general theory is developed for learning with structured sparsity, based on the notion of coding complexity associated with the structure. Moreover, a structured greedy algorithm is proposed to efficiently solve the structured sparsity problem. Experiments demonstrate the advantage of structured sparsity over standard sparsity. 1.
High-dimensional additive modeling
- Annals of Statistics
"... We propose a new sparsity-smoothness penalty for high-dimensional generalized additive models. The combination of sparsity and smoothness is crucial for mathematical theory as well as performance for finite-sample data. We present a computationally efficient algorithm, with provable numerical conver ..."
Abstract
-
Cited by 78 (3 self)
- Add to MetaCart
(Show Context)
We propose a new sparsity-smoothness penalty for high-dimensional generalized additive models. The combination of sparsity and smoothness is crucial for mathematical theory as well as performance for finite-sample data. We present a computationally efficient algorithm, with provable numerical convergence properties, for optimizing the penalized likelihood. Furthermore, we provide oracle results which yield asymptotic optimality of our estimator for high-dimensional but sparse additive models. Finally, an adaptive version of our sparsity-smoothness penalized approach yields large additional performance gains. 1
Minimax-optimal rates for sparse additive models over kernel classes via convex programming
"... Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f ∗ j, where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f ∗ j lies in a reproducing kernel Hilbert space (RKHS), and ..."
Abstract
-
Cited by 52 (8 self)
- Add to MetaCart
Sparse additive models are families of d-variate functions with the additive decomposition f ∗ = ∑ j∈S f ∗ j, where S is an unknown subset of cardinality s ≪ d. In this paper, we consider the case where each univariate component function f ∗ j lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f ∗ based on kernels combined with ℓ1-type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L2 (P) and L2 (Pn) norms over the classF d,s,H of sparse additive models with each univariate function f ∗ j in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much faster estimation rates are possible for any sparsity s=Ω ( √ n), showing that global boundedness is a significant restriction in the high-dimensional setting.
Generalization bounds for learning kernels
- In ICML ’10,2010
"... This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L1 regularization admits o ..."
Abstract
-
Cited by 30 (3 self)
- Add to MetaCart
This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L1 regularization admits only a √ log p dependency on the number of kernels, which is tight and considerably more favorable than the previous best bound given for the same problem. We also give a novel bound for learning with a non-negative combination of p base kernels with an L2 regularization whose dependency on p is also tight and only in p 1/4. We present similar results for Lq regularization with other values of q, and outline the relevance of our proof techniques to the analysis of the complexity of the class of linear functions. Experiments with a large number of kernels further validate the behavior of the generalization error as a function of p predicted by our bounds.
High-Dimensional Non-Linear Variable Selection through Hierarchical Kernel Learning
, 2009
"... We consider the problem of high-dimensional non-linear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize non-linear interactions between the original variables. T ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
We consider the problem of high-dimensional non-linear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize non-linear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graph-adapted sparsity-inducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in high-dimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show state-of-the-art predictive performance for non-linear regression problems. 1
Taking Advantage of Sparsity in Multi-Task Learning
"... We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multi-task learning [1], we assume that the sparsity patterns of the regression vectors are included in the same set of small cardinality. This ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
We study the problem of estimating multiple linear regression equations for the purpose of both prediction and variable selection. Following recent work on multi-task learning [1], we assume that the sparsity patterns of the regression vectors are included in the same set of small cardinality. This assumption leads us to consider the Group Lasso as a candidate estimation method. We show that this estimator enjoys nice sparsity oracle inequalities and variable selection properties. The results hold under a certain restricted eigenvalue condition and a coherence condition on the design matrix, which naturally extend recent work in [3, 19]. In particular, in the multi-task learning scenario, in which the number of tasks can grow, we are able to remove completely the effect of the number of predictor variables in the bounds. Finally, we show how our results can be extended to more general noise distributions, of which we only require the variance to be finite. 1 1
Algorithms for Learning Kernels Based on Centered Alignment
"... This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for le ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression.
Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. The Annals of Statistics
, 2013
"... We investigate the learning rate of multiple kernel leaning (MKL) with ℓ1 and elastic-net regularizations. The elastic-net regularization is a composition of an ℓ1-regularizer for inducing the sparsity and an ℓ2-regularizer for controlling the smoothness. We focus on a sparse setting where the total ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
We investigate the learning rate of multiple kernel leaning (MKL) with ℓ1 and elastic-net regularizations. The elastic-net regularization is a composition of an ℓ1-regularizer for inducing the sparsity and an ℓ2-regularizer for controlling the smoothness. We focus on a sparse setting where the total number of kernels is large but the number of non-zero components of the ground truth is relatively small, and show sharper convergence rates than the learning rates ever shown for both ℓ1 and elastic-net regularizations. Our analysis shows there appears a trade-off between the sparsity and the smoothness when it comes to selecting which of ℓ1 and elastic-net regularizations to use; if the ground truth is smooth, the elastic-net regularization is preferred, otherwise the ℓ1 regularization is preferred. 1