Empirical bayes estimates for largescale prediction problems
, 2008
"... Classical prediction methods such as Fisher’s linear discriminant function were designed for smallscale problems, where the number of predictors N is much smaller than the number of observations n. Modern scientific devices often reverse this situation. A microarray analysis, for example, might inc ..."
Classical prediction methods such as Fisher’s linear discriminant function were designed for smallscale problems, where the number of predictors N is much smaller than the number of observations n. Modern scientific devices often reverse this situation. A microarray analysis, for example, might include n = 100 subjects measured on N = 10, 000 genes, each of which is a potential predictor. This paper proposes an empirical Bayes approach to largescale prediction, where the optimum Bayes prediction rule is estimated employing the data from all the predictors. Microarray examples are used to illustrate the method. The results show a close connection with the shrunken centroids algorithm of Tibshirani et al. (2002), a frequentist regularization approach to largescale prediction, and also with false discovery rate theory.
ON THE FALSE DISCOVERY RATE AND AN ASYMPTOTICALLY OPTIMAL REJECTION CURVE 1
, 903
"... In this paper we introduce and investigate a new rejection curve for asymptotic control of the false discovery rate (FDR) in multiple hypotheses testing problems. We first give a heuristic motivation for this new curve and propose some procedures related to it. Then we introduce a set of possible as ..."
In this paper we introduce and investigate a new rejection curve for asymptotic control of the false discovery rate (FDR) in multiple hypotheses testing problems. We first give a heuristic motivation for this new curve and propose some procedures related to it. Then we introduce a set of possible assumptions and give a unifying short proof of FDR control for procedures based on Simes ’ critical values, whereby certain types of dependency are allowed. This methodology of proof is then applied to other fixed rejection curves including the proposed new curve. Among others, we investigate the problem of finding least favorable parameter configurations such that the FDR becomes largest. We then derive a series of results concerning asymptotic FDR control for procedures based on the new curve and discuss several example procedures in more detail. A main result will be an asymptotic optimality statement for various procedures based on the new curve in the class of fixed rejection curves. Finally, we briefly discuss strict FDR control for a finite number of hypotheses.
Simultaneous Inference: When Should Hypothesis Testing Problems Be Combined?
"... Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature ten ..."
Modern statisticians are often presented with hundreds or thousands of hypothesis testing problems to evaluate at the same time, generated from new scientific technologies such as microarrays, medical and satellite imaging devices, or flow cytometry counters. The relevant statistical literature tends to begin with the tacit assumption that a single combined analysis, for instance a False Discovery Rate assessment, should be applied to the entire set of problems at hand. This can be a dangerous assumption, as the examples in the paper show, leading to overly conservative or overly liberal conclusions within any particular subclass of the cases. A simple Bayesian theory yields a succinct description of the effects of separation or combination on false discovery rate analyses. The theory allows efficient testing within small subclasses, and has applications to “enrichment”, the detection of multicase effects. Key Words: false discovery rates, Twoclass model, enrichment 1. Introduction Modern scientific devices such as microarrays routinely provide the statistician with thousands of hypothesis testing problems to consider at the same time. A
The Strength of Statistical Evidence for Composite Hypotheses: Inference to the Best Explanation
, 2010
"... A general function to quantify the weight of evidence in a sample of data for one hypothesis over another is derived from the law of likelihood and from a statistical formalization of inference to the best explanation. For a fixed parameter of interest, the resulting weight of evidence that favors o ..."
A general function to quantify the weight of evidence in a sample of data for one hypothesis over another is derived from the law of likelihood and from a statistical formalization of inference to the best explanation. For a fixed parameter of interest, the resulting weight of evidence that favors one composite hypothesis over another is the likelihood ratio using the parameter value consistent with each hypothesis that maximizes the likelihood function over the parameter of interest. Since the weight of evidence is generally only known up to a nuisance parameter, it is approximated by replacing the likelihood function with a reduced likelihood function on the interest parameter space. Unlike the Bayes factor and unlike the pvalue under interpretations that extend its scope, the weight of evidence is coherent in the sense that it cannot support a hypothesis over any hypothesis that it entails. Further, when comparing the hypothesis that the parameter lies outside a nontrivial interval to the hypothesis that it lies within the interval, the proposed method of weighing evidence almost always asymptotically favors the correct hypothesis
Inference with transposable data: modelling the effects of row and column correlations
 Journal of the Royal Statistical Society: Series B (Statistical Methodology
, 2012
"... Summary. We consider the problem of largescale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are transposable meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario ..."
Summary. We consider the problem of largescale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are transposable meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario is detecting significant genes in microarrays when the samples may be dependent due to latent variables or unknown batch effects. By modeling this matrix data using the matrixvariate normal distribution, we study and quantify the effects of row and column correlations on procedures for largescale inference. We then propose a simple solution to the myriad of problems presented by unanticipated correlations: We simultaneously estimate row and column covariances and use these to sphere or decorrelate the noise in the underlying data before conducting inference. This procedure yields data with approximately independent rows and columns so that test statistics more closely follow null distributions and multiple testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: (1) increased statistical power, (2) less bias in estimating the false discovery rate, and (3) reduced variance of the false discovery rate estimators.
Widespread OverExpression of the X Chromosome in Sterile F1 Hybrid Mice
, 2010
"... The X chromosome often plays a central role in hybrid male sterility between species, but it is unclear if this reflects underlying regulatory incompatibilities. Here we combine phenotypic data with genomewide expression data to directly associate aberrant expression patterns with hybrid male steri ..."
The X chromosome often plays a central role in hybrid male sterility between species, but it is unclear if this reflects underlying regulatory incompatibilities. Here we combine phenotypic data with genomewide expression data to directly associate aberrant expression patterns with hybrid male sterility between two species of mice. We used a reciprocal cross in which F 1 males are sterile in one direction and fertile in the other direction, allowing us to associate expression differences with sterility rather than with other hybrid phenotypes. We found evidence of extensive overexpression of the X chromosome during spermatogenesis in sterile but not in fertile F1 hybrid males. Overexpression was most pronounced in genes that are normally expressed after meiosis, consistent with an X chromosomewide disruption of expression during the later stages of spermatogenesis. This pattern was not a simple consequence of faster evolutionary divergence on the X chromosome, because Xlinked expression was highly conserved between the two species. Thus, transcriptional regulation of the X chromosome during spermatogenesis appears particularly sensitive to evolutionary divergence between species. Overall, these data provide evidence for an underlying regulatory basis to reproductive isolation in house mice and
Multiple hypothesis testing, adjusting for latent variables
, 2011
"... In high throughput settings we inspect a great many candidate variables (e.g. genes) searching for associations with a primary variable (e.g. a phenotype). High throughput hypothesis testing can be made difficult by the presence of systemic effects and other latent variables. It is well known that t ..."
In high throughput settings we inspect a great many candidate variables (e.g. genes) searching for associations with a primary variable (e.g. a phenotype). High throughput hypothesis testing can be made difficult by the presence of systemic effects and other latent variables. It is well known that those variables alter the level of tests and induce correlations between tests. It is less well known that dependencies can change the relative ordering of significance levels among hypotheses. Poor rankings lead to wasteful and ineffective followup studies. The problem becomes acute for latent variables that are correlated with the primary variable. We propose a two stage analysis to counter the effects of latent variables on the ranking of hypotheses. Our method, called LEAPP, statistically isolates the latent variables from the primary one. In simulations it gives better ordering of hypotheses than competing methods such as SVA and EIGENSTRAT. For an illustration, we turn to data from the AGEMAP study relating gene expression to age for 16 tissues in the mouse. LEAPP generates rankings with greater consistency across tissues than the rankings attained by the other methods. 1
PowerEnhanced Multiple Decision Functions Controlling FamilyWise Error and False Discovery Rates
, 2009
Minimum Description Length and Empirical Bayes Methods of Identifying SNPs Associated with Disease
, 2010
Statistical validation of a global model for the distribution published in a scientific journal
 Journal of the American Society for Information Science
, 2010
"... A central issue in evaluative bibliometrics is the characterization of the citation distribution of papers in the scientific literature. Here, we perform a largescale empirical analysis of journals from every field in Thomson Reuters ’ Web of Science database. We find that only 30 of the 2,184 jo ..."
A central issue in evaluative bibliometrics is the characterization of the citation distribution of papers in the scientific literature. Here, we perform a largescale empirical analysis of journals from every field in Thomson Reuters ’ Web of Science database. We find that only 30 of the 2,184 journals have citation distributions that are inconsistent with a discrete lognormal distribution at the rejection threshold that controls the false discovery rate at 0.05. We find that large, multidisciplinary journals are overrepresented in this set of 30 journals, leading us to conclude that, within a discipline, citation distributions are lognormal. Our results strongly suggest that the discrete lognormal distribution is a globally accurate model for the distribution of “eventual impact ” of scientific papers published in singlediscipline journal in a single year that is removed sufficiently from the present date.