Results 1  10
of
202
Structured variable selection with sparsityinducing norms
, 2011
"... We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to ov ..."
Abstract

Cited by 193 (31 self)
 Add to MetaCart
(Show Context)
We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsityinducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1norm and the group ℓ1norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for leastsquares linear regression in low and highdimensional settings.
Structured Sparsity through Convex Optimization
"... Abstract. Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the ℓ1norm. In this paper, we cons ..."
Abstract

Cited by 48 (7 self)
 Add to MetaCart
(Show Context)
Abstract. Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the ℓ1norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge is available as well. We show that the ℓ1norm can then be extended to structured norms built on either disjoint or overlapping groups of variables, leading to a flexible framework that can deal with various structures. We present applications to unsupervised learning, for structured sparse principal component analysis and hierarchical dictionary learning, and to supervised learning in the context of nonlinear variable selection. Key words and phrases: Sparsity, convex optimization. 1.
Selfconcordant analysis for logistic regression
"... Most of the nonasymptotic theoretical work in regression is carried out for the square loss, where estimators can be obtained through closedform expressions. In this paper, we use and extend tools from the convex optimization literature, namely selfconcordant functions, to provide simple extensio ..."
Abstract

Cited by 48 (15 self)
 Add to MetaCart
(Show Context)
Most of the nonasymptotic theoretical work in regression is carried out for the square loss, where estimators can be obtained through closedform expressions. In this paper, we use and extend tools from the convex optimization literature, namely selfconcordant functions, to provide simple extensions of theoretical results for the square loss to the logistic loss. We apply the extension techniques to logistic regression with regularization by the ℓ2norm and regularization by the ℓ1norm, showing that new results for binary classification through logistic regression can be easily derived from corresponding results for leastsquares regression. 1
Pvalues for highdimensional regression
, 2009
"... Assigning significance in highdimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid pvalues are not available. An exception is a recent proposal by Wasserman and Roeder (2008) which splits ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
Assigning significance in highdimensional regression is challenging. Most computationally efficient selection algorithms cannot guard against inclusion of noise variables. Asymptotically valid pvalues are not available. An exception is a recent proposal by Wasserman and Roeder (2008) which splits the data into two parts. The number of variables is then reduced to a manageable size using the first split, while classical variable selection techniques can be applied to the remaining variables, using the data from the second split. This yields asymptotic error control under minimal conditions. It involves, however, a onetime random split of the data. Results are sensitive to this arbitrary choice: it amounts to a “pvalue lottery ” and makes it difficult to reproduce results. Here, we show that inference across multiple random splits can be aggregated, while keeping asymptotic control over the inclusion of noise variables. In addition, the proposed aggregation is shown to improve power, while reducing the number of falsely selected variables substantially. Keywords: Highdimensional variable selection, data splitting, multiple comparisons. 1
Estimating highdimensional intervention effects from observation data
 THE ANN OF STAT
, 2009
"... We assume that we have observational data generated from an unknown underlying directed acyclic graph (DAG) model. A DAG is typically not identifiable from observational data, but it is possible to consistently estimate the equivalence class of a DAG. Moreover, for any given DAG, causal effects can ..."
Abstract

Cited by 34 (7 self)
 Add to MetaCart
We assume that we have observational data generated from an unknown underlying directed acyclic graph (DAG) model. A DAG is typically not identifiable from observational data, but it is possible to consistently estimate the equivalence class of a DAG. Moreover, for any given DAG, causal effects can be estimated using intervention calculus. In this paper, we combine these two parts. For each DAG in the estimated equivalence class, we use intervention calculus to estimate the causal effects of the covariates on the response. This yields a collection of estimated causal effects for each covariate. We show that the distinct values in this set can be consistently estimated by an algorithm that uses only local information of the graph. This local approach is computationally fast and feasible in highdimensional problems. We propose to use summary measures of the set of possible causal effects to determine variable importance. In particular, we use the minimum absolute value of this set, since that is a lower bound on the size of the causal effect. We demonstrate the merits of our methods in a simulation study and on a data set about riboflavin production.
PACBayesian bounds for sparse regression estimation with exponential weights
 Electronic Journal of Statistics
"... Abstract. We consider the sparse regression model where the number of parameters p is larger than the sample size n. The difficulty when considering highdimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
Abstract. We consider the sparse regression model where the number of parameters p is larger than the sample size n. The difficulty when considering highdimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The BIC estimator for instance performs well from the statistical point of view [11] but can only be computed for values of p of at most a few tens. The Lasso estimator is solution of a convex minimization problem, hence computable for large value of p. However stringent conditions on the design are required to establish fast rates of convergence for this estimator. Dalalyan and Tsybakov [19] propose a method achieving a good compromise between the statistical and computational aspects of the problem. Their estimator can be computed for reasonably large p and satisfies nice statistical properties under weak assumptions on the design. However, [19] proposes sparsity oracle inequalities in expectation for the empirical excess risk only. In this paper, we propose an aggregation procedure similar to that of [19] but with improved statistical performances. Our main theoretical result is a sparsity oracle inequality in probability for the true excess risk for a version of exponential weight estimator. We also propose a MCMC method to compute our estimator for reasonably large values of p.
Trace lasso: A trace norm regularization for correlated designs
 In Advances in Neural Information Processing Systems 24
, 2011
"... Using the `1norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
Using the `1norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm, which is a convex surrogate of the rank, of the selected covariates as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted leastsquares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net. 1
Sparsistent learning of varyingcoefficient models with structural changes
 In NIPS
, 2009
"... To estimate the changing structure of a varyingcoefficient varyingstructure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sp ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
(Show Context)
To estimate the changing structure of a varyingcoefficient varyingstructure (VCVS) model remains an important and open problem in dynamic system modelling, which includes learning trajectories of stock prices, or uncovering the topology of an evolving gene network. In this paper, we investigate sparsistent learning of a subfamily of this model — piecewise constant VCVS models. We analyze two main issues in this problem: inferring time points where structural changes occur and estimating model structure (i.e., model selection) on each of the constant segments. We propose a twostage adaptive procedure, which first identifies jump points of structural changes and then identifies relevant covariates to a response on each of the segments. We provide an asymptotic analysis of the procedure, showing that with the increasing sample size, number of structural changes, and number of variables, the true model can be consistently selected. We demonstrate the performance of the method on synthetic data and apply it to the brain computer interface dataset. We also consider how this applies to structure estimation of timevarying probabilistic graphical models. 1
Causal network inference via group sparse regularization
 IEEE Trans. Signa
, 2011
"... ar ..."
(Show Context)
A significance test for the lasso
"... In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test st ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important highdimensional case p> n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chisquared test, comparing the drop in residual sum of squares (RSS) to a χ 2 1 distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than χ 2 1 under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the ℓ1 penalty. Therefore the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties—adaptivity and shrinkage—and its null distribution is tractable and asymptotically Exp(1).