Results 1 - 10
of
28
Confidence intervals and hypothesis testing for highdimensional regression. arXiv: 1306.3171
"... Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely cha ..."
Abstract
-
Cited by 31 (3 self)
- Add to MetaCart
(Show Context)
Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values. We consider here a broad class regression problems, and propose an efficient algorithm for constructing confidence intervals and p-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a ‘de-biased ’ version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. Furthermore, proofs are remarkably simple. We test our method on a diabetes prediction problem. 1
A significance test for the lasso
"... In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test st ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
In the sparse linear regression setting, we consider testing the significance of the predictor variable that enters the current lasso model, in the sequence of models visited along the lasso solution path. We propose a simple test statistic based on lasso fitted values, called the covariance test statistic, and show that when the true model is linear, this statistic has an Exp(1) asymptotic distribution under the null hypothesis (the null being that all truly active variables are contained in the current lasso model). Our proof of this result for the special case of the first predictor to enter the model (i.e., testing for a single significant predictor variable against the global null) requires only weak assumptions on the predictor matrix X. On the other hand, our proof for a general step in the lasso path places further technical assumptions on X and the generative model, but still allows for the important high-dimensional case p> n, and does not necessarily require that the current lasso model achieves perfect recovery of the truly active variables. Of course, for testing the significance of an additional variable between two nested linear models, one typically uses the chi-squared test, comparing the drop in residual sum of squares (RSS) to a χ 2 1 distribution. But when this additional variable is not fixed, and has been chosen adaptively or greedily, this test is no longer appropriate: adaptivity makes the drop in RSS stochastically much larger than χ 2 1 under the null hypothesis. Our analysis explicitly accounts for adaptivity, as it must, since the lasso builds an adaptive sequence of linear models as the tuning parameter λ decreases. In this analysis, shrinkage plays a key role: though additional variables are chosen adaptively, the coefficients of lasso active variables are shrunken due to the ℓ1 penalty. Therefore the test statistic (which is based on lasso fitted values) is in a sense balanced by these two opposing properties—adaptivity and shrinkage—and its null distribution is tractable and asymptotically Exp(1).
Sparse methods for biomedical data
- SIGKDD Explor. Newsl
, 2012
"... Following recent technological revolutions, the investigation of massive biomedical data with growing scale, diversity, and complexity has taken a center stage in modern data analysis. Although complex, the underlying representations of many biomedical data are often sparse. For example, for a certa ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
(Show Context)
Following recent technological revolutions, the investigation of massive biomedical data with growing scale, diversity, and complexity has taken a center stage in modern data analysis. Although complex, the underlying representations of many biomedical data are often sparse. For example, for a certain disease such as leukemia, even though humans have tens of thousands of genes, only a few genes are relevant to the disease; a gene network is sparse since a regulatory pathway involves only a small number of genes; many biomedical signals are sparse or compressible in the sense that they have concise representations when expressed in a proper basis. Therefore, finding sparse representations is fundamentally important for scientific discovery. Sparse methods based on the ℓ1 norm have attracted a great amount of research efforts in the past decade due to its sparsity-inducing property, convenient convexity, and strong theoretical guarantees. They have achieved great success in various applications such as biomarker selection, biological network construction, and magnetic resonance imaging. In this paper, we review state-of-the-art sparse methods and their applications to biomedical data.
Uniform Post Selection Inference for LAD Regression and Other Z-Estimation Problems.” arXiv e-print
, 2013
"... Abstract. We develop uniformly valid confidence regions for a regression coefficient in a high-dimensional sparse LAD (least absolute deviation or median) regression model. The setting is one where the number of regressors could be large in comparison to the sample size , but only ≪ of them are nee ..."
Abstract
-
Cited by 7 (5 self)
- Add to MetaCart
(Show Context)
Abstract. We develop uniformly valid confidence regions for a regression coefficient in a high-dimensional sparse LAD (least absolute deviation or median) regression model. The setting is one where the number of regressors could be large in comparison to the sample size , but only ≪ of them are needed to accurately describe the regression function. Our new methods are based on the instrumental LAD regression estimator that assembles the optimal estimating equation from either post ℓ 1 -penalized LAD regression or ℓ 1 -penalized LAD regression. The estimating equation is immunized against non-regular estimation of nuisance part of the regression function, in the sense of Neyman. We establish that in a homoscedastic regression model, under certain conditions, the instrumental LAD regression estimator of the regression coefficient is asymptotically root-normal uniformly with respect to the underlying sparse model. The resulting confidence regions are valid uniformly with respect to the underlying model. The new inference methods outperform the naive, "oracle based" inference methods, which are known to be not uniformly valid -with coverage property failing to hold uniformly with respect the underlying model -even in the setting with = 2. We also provide Monte-Carlo experiments which demonstrate that standard post-selection inference breaks down over large parts of the parameter space, and the proposed method does not.
Exact Post Model Selection Inference for Marginal Screening
, 2014
"... We develop a framework for post model selection inference, via marginal screening, in linear regression. At the core of this framework is a result that characterizes the exact distribution of linear functions of the response y, conditional on the model being selected (“condi-tion on selection ” fram ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
We develop a framework for post model selection inference, via marginal screening, in linear regression. At the core of this framework is a result that characterizes the exact distribution of linear functions of the response y, conditional on the model being selected (“condi-tion on selection ” framework). This allows us to construct valid con-fidence intervals and hypothesis tests for regression coefficients that account for the selection procedure. In contrast to recent work in high-dimensional statistics, our results are exact (non-asymptotic) and re-quire no eigenvalue-like assumptions on the design matrix X. Further-more, the computational cost of marginal regression, constructing con-fidence intervals and hypothesis testing is negligible compared to the cost of linear regression, thus making our methods particularly suitable for extremely large datasets. Although we focus on marginal screening to illustrate the applicability of the condition on selection framework, this framework is much more broadly applicable. We show how to ap-ply the proposed framework to several other selection procedures in-cluding orthogonal matching pursuit, non-negative least squares, and marginal screening+Lasso. 1
A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica Forthcoming.
, 2014
"... Abstract: If there are datasets, too large to fit into a single computer or too expensive for a computationally intensive data analysis, what should we do? We propose a split-and-conquer approach and illustrate it using several computationally intensive penalized regression methods, along with a th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract: If there are datasets, too large to fit into a single computer or too expensive for a computationally intensive data analysis, what should we do? We propose a split-and-conquer approach and illustrate it using several computationally intensive penalized regression methods, along with a theoretical support. We show that the split-and-conquer approach can substantially reduce computing time and computer memory requirements. The proposed methodology is illustrated numerically using both simulation and data examples.
Gaussian graphical model estimation with false discovery rate control
- Annals of Statistics
, 2013
"... This paper studies the estimation of high dimensional Gaussian graphical model (GGM). Typically, the existing methods depend on regularization techniques. As a result, it is necessary to choose the regularized parameter. However, the precise relationship between the regularized parameter and the num ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This paper studies the estimation of high dimensional Gaussian graphical model (GGM). Typically, the existing methods depend on regularization techniques. As a result, it is necessary to choose the regularized parameter. However, the precise relationship between the regularized parameter and the number of false edges in GGM estimation is unclear. Hence, it is impossible to evaluate their performance rigorously. In this paper, we propose an alternative method by a multiple testing procedure. Based on our new test statistics for conditional dependence, we pro-pose a simultaneous testing procedure for conditional dependence in GGM. Our method can control the false discovery rate (FDR) asymptotically. The numerical performance of the proposed method shows that our method works quite well. 1
Inference in High Dimensions with the Penalized Score Test.” posted on arXiv: 1401.2678v1
, 2014
"... In recent years, there has been considerable theoretical development regarding variable selection consistency of penalized regression techniques, such as the lasso. However, there has been relatively little work on quantifying the uncertainty in these selection procedures. In this paper, we propose ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
In recent years, there has been considerable theoretical development regarding variable selection consistency of penalized regression techniques, such as the lasso. However, there has been relatively little work on quantifying the uncertainty in these selection procedures. In this paper, we propose a new method for inference in high dimensions using a score test based on penalized regression. In this test, we perform penalized regression of an outcome on all but a single feature, and test for correlation of the residuals with the held-out feature. This procedure is applied to each feature in turn. Interestingly, when an `1 penalty is used, the sparsity pattern of the lasso corresponds exactly to a decision based on the proposed test. Further, when an `2 penalty is used, the test corresponds precisely to a score test in a mixed effects model, in which the effects of all but one feature are assumed to be random. We formulate the hypothesis being tested as a compromise between the null hypotheses tested in simple linear regression on each feature and in multiple linear regression on all features, and develop reference distributions for some well-known penalties. We also examine the behavior of the test on real and simulated data.
Asymptotic properties of lasso+mls and lasso+ridge in sparse high-dimensional linear regression
- Electronic Journal of Statistics
, 2013
"... ar ..."
(Show Context)