Results 1 
5 of
5
Rare and weak effects in largescale inference: methods and phase diagrams
, 2014
"... Often when we deal with ‘Big Data’, the true effects we are interested in are Rare and Weak (RW). Researchers measure a large number of features, hoping to find perhaps only a small fraction of them to be relevant to the research in question; the effect sizes of the relevant features are individuall ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Often when we deal with ‘Big Data’, the true effects we are interested in are Rare and Weak (RW). Researchers measure a large number of features, hoping to find perhaps only a small fraction of them to be relevant to the research in question; the effect sizes of the relevant features are individually small so the true effects are not strong enough to stand out for themselves. Higher Criticism (HC) and Graphlet Screening (GS) are two classes of methods that are specifically designed for the Rare/Weak settings. HC was introduced to determine whether there are any relevant effects in all the measured features. More recently, HC was applied to classification, where it provides a method for selecting useful predictive features for trained classification rules. GS was introduced as a graphguided multivariate screening procedure, and was used for variable selection. We develop a theoretic framework where we use an Asymptotic Rare and Weak (ARW) model simultaneously controlling the size and prevalence of useful/significant features among the useless/null bulk. At the heart of the ARW model is the socalled phase diagram, which is a way to visualize clearly the class of ARW settings where the relevant effects are so rare or weak that desired goals (signal detection, variable selection, etc.) are simply impossible to achieve. We show that HC and GS have important advantages over better known procedures and achieve the optimal phase diagrams in a variety of ARW settings. HC and GS are flexible ideas that adapt easily to many interesting situations. We review the basics of these ideas and some of the recent extensions, discuss their connections to existing literature, and suggest some new applications of these ideas.
Partial Correlation Screening for Estimating Large Precision Matrices, with Applications to Classification
"... Given n samples X1, X2,..., Xn from N(0,Σ), we are interested in estimating the p × p precision matrix Ω = Σ−1; we assume Ω is sparse in that each row has relatively few nonzeros. We propose Partial Correlation Screening (PCS) as a new rowbyrow approach. To estimate the ith row of Ω, 1 ≤ i ≤ p, P ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Given n samples X1, X2,..., Xn from N(0,Σ), we are interested in estimating the p × p precision matrix Ω = Σ−1; we assume Ω is sparse in that each row has relatively few nonzeros. We propose Partial Correlation Screening (PCS) as a new rowbyrow approach. To estimate the ith row of Ω, 1 ≤ i ≤ p, PCS uses a Screen step and a Clean step. In the Screen step, PCS recruits a (small) subset of indices using a stagewise algorithm, where in each stage, the algorithm updates the set of recruited indices by adding the index j that has the largest empirical partial correlation (in magnitude) with i, given the set of indices recruited so far. In the Clean step, PCS first reinvestigates all recruited indices in hopes of removing false positives, and then uses the resultant set of indices to reconstruct the ith row of Ω. PCS is computationally efficient and modest in memory use: to estimate a row of Ω, it only needs a few rows (determined sequentially) of the empirical covariance matrix. This enables PCS to execute the estimation of a large precision matrix (e.g., p = 10K)
Higher Criticism for LargeScale Inference, Especially for Rare and Weak Effects
"... In modern highthroughput data analysis, researchers perform a large number of statistical tests, expecting to find perhaps a small fraction of significant effects against a predominantly null background. Higher Criticism (HC) was introduced to determine whether there are any nonzero effects; mor ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
In modern highthroughput data analysis, researchers perform a large number of statistical tests, expecting to find perhaps a small fraction of significant effects against a predominantly null background. Higher Criticism (HC) was introduced to determine whether there are any nonzero effects; more recently, it was applied to feature selection, where it provides a method for selecting useful predictive features from a large body of potentially useful features, among which only a rare few will prove truly useful. In this article, we review the basics of HC in both the testing and feature selection settings. HC is a flexible idea, which adapts easily to new situations; we point out how it adapts to clique detection and bivariate outlier detection. HC, although still early in its development, is seeing increasing interest from practitioners; we illustrate this with worked examples. HC is computationally effective, which gives it a nice leverage in the increasingly more relevant “Big Data ” settings we see today. We also review the underlying theoretical “ideology ” behind HC. The Rare/Weak (RW) model is a theoretical framework simultaneously controlling the size and prevalence of useful/significant items among the useless/null bulk. The RW model shows that HC has important advantages over better known procedures such as False Discovery Rate (FDR) control and Familywise Error control (FwER), in particular, certain optimality properties. We discuss the rare/weak phase diagram, a way to visualize clearly the class of RW settings where the true signals are so rare or so weak that detection and feature selection are simply impossible, and a way to understand the known optimality properties of HC. Dedications. To the memory of John W. Tukey 1915–2000 and of Yuri I. Ingster 1946–2012, two pioneers in mathematical statistics. Key words. Classification; control of FDR; feature selection; Higher Criticism; large co
Important Features PCA for high dimensional clustering
"... We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2,..., n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p n, where classical clustering methods face challenges. W ..."
Abstract
 Add to MetaCart
(Show Context)
We consider a clustering problem where we observe feature vectors Xi ∈ Rp, i = 1, 2,..., n, from K possible classes. The class labels are unknown and the main interest is to estimate them. We are primarily interested in the modern regime of p n, where classical clustering methods face challenges. We propose Important Features PCA (IFPCA) as a new clustering procedure. In IFPCA, we select a small fraction of features with the largest KolmogorovSmirnov (KS) scores, where the threshold is chosen by adapting the recent notion of Higher Criticism, obtain the first (K − 1) left singular vectors of the postselection normalized data matrix, and then estimate the labels by applying the classical kmeans to these singular vectors. It can be seen that IFPCA is a tuning free clustering method. We apply IFPCA to 10 gene microarray data sets. The method has competitive performance in clustering. Especially, in three of the data sets, the error rates of IFPCA are only 29 % or less of the error rates by other methods. We have also rediscovered a phenomenon on empirical null by [16] on microarray data. With delicate analysis, especially postselection eigenanalysis, we derive tight probability bounds on the KolmogorovSmirnov statistics and show that IFPCA yields clustering consistency in a broad context. The clustering problem is connected to the problems of sparse PCA and lowrank matrix recovery, but it is different in important ways. We reveal an interesting phase transition phenomenon associated with these problems and identify the range of interest for each.
and Financial
, 2013
"... Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with smallscale data. On the other hand, the massive sample size and high dimensio ..."
Abstract
 Add to MetaCart
(Show Context)
Big Data bring new opportunities to modern society and challenges to data scientists. On the one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with smallscale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity and measurement errors. These challenges are distinguished and require new computational and statistical paradigm.This paper gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasize on the viability of the sparsest solution in highconfidence set and point out that exogenous assumptions in most statistical methods for Big Data cannot be validated due to incidental endogeneity.They can lead to wrong statistical inferences and consequently wrong scientific conclusions.