• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Feature selection in omics prediction problems using cat scores and false non-discovery rate control. (2010)

by M Ahdesmäki, K Strimmer
Venue:Ann. Appl. Statist.,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 21
Next 10 →

Gene ranking and biomarker discovery under correlation. Bioinformatics

by Verena Zuber, Korbinian Strimmer
"... Motivation: Biomarker discovery and gene ranking is a standard task in genomic high throughput analysis.Typically, the ordering of markers is based on a stabilized variant of the t-score, such as the moderated t or the SAM statistic. However, these procedures ignore gene-gene correlations, which may ..."
Abstract - Cited by 16 (3 self) - Add to MetaCart
Motivation: Biomarker discovery and gene ranking is a standard task in genomic high throughput analysis.Typically, the ordering of markers is based on a stabilized variant of the t-score, such as the moderated t or the SAM statistic. However, these procedures ignore gene-gene correlations, which may have a profound impact on the gene orderings and on the power of the subsequent tests. Results: We propose a simple procedure that adjusts gene-wise t-statistics to take account of correlations among genes. The resulting correlation-adjusted t-scores (“cat ” scores) are derived from a predictive perspective, i.e. as a score for variable selection to discriminate group membership in two-class linear discriminant analysis. In the absence of correlation the cat score reduces to the standard t-score. Moreover, using the cat score it is straightforward to evaluate groups of features (i.e. gene sets). For computation of the cat score from small sample data we propose a shrinkage procedure. In a comparative study comprising six different synthetic and empirical correlation structures we show that the cat score improves estimation of gene orderings and leads to higher power for fixed true discovery rate, and vice versa. Finally, we also illustrate the cat score by analyzing metabolomic data. Availability: The shrinkage cat score is implemented in the R package “st ” available from

MALDIquant: a versatile R package for the analysis of mass spectrometry data., Bioinformatics 28

by Sebastian Gibb, Korbinian Strimmer , 2012
"... Summary: MALDIquant is an R package providing a complete and modular analysis pipeline for quantitative analysis of mass spectrometry data. MALDIquant is specif-ically designed with application in clinical diagnostics in mind and implements sophisticated routines for importing raw data, preprocessin ..."
Abstract - Cited by 12 (5 self) - Add to MetaCart
Summary: MALDIquant is an R package providing a complete and modular analysis pipeline for quantitative analysis of mass spectrometry data. MALDIquant is specif-ically designed with application in clinical diagnostics in mind and implements sophisticated routines for importing raw data, preprocessing, non-linear peak align-ment, and calibration. It also handles technical replicates as well as spectra with unequal resolution. Availability: MALDIquant and its associated R packages readBrukerFlexData and readMzXmlData are freely available from the R archive CRAN
(Show Context)

Citation Context

...resulting calibrated peak intensity matrix may be exported for further use in high-level statistical analysis, for instance classification and feature selection using shrinkage discriminant analysis (=-=Ahdesmäki and Strimmer, 2010-=-). 4 Conclusion MALDIquant is a versatile R package providing a flexible analysis pipeline for MALDITOF and other mass spectrometry data. It offers a number of distinctive features, in particular for ...

Signal identification for rare and weak features: higher criticism or false discovery rates? Biostatistics

by Bernd Klaus, Korbinian Strimmer , 2012
"... Signal identification in large-dimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show th ..."
Abstract - Cited by 8 (2 self) - Add to MetaCart
Signal identification in large-dimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show that the HC threshold may be viewed as an approximation to a natural class boundary (CB) in two-class discriminant analysis which in turn is expressible as FDR threshold. We demonstrate that in a rare-weak setting in the region of the phase space where signal identifi-cation is possible both thresholds are practicably indistinguishable, and thus HC thresholding is identical to using a simple local FDR cutoff. The relationship of the HC and CB thresholds and their properties are investigated both analytically and by simulations, and are further compared by application to four cancer gene expression data sets.
(Show Context)

Citation Context

...the alternative with only little contamination by unwanted null features. Conversely, if the interest is to identify true null features then similar thresholds may be imposed on FNDR rather than FDR (=-=Ahdesmäki and Strimmer, 2010-=-). This illustrated for local FDR and local FNDR in Fig. 1b where the signal space is divided by the decision thresholds xfdr and xfndr into three distinct zones corresponding 7 to areas where one is ...

High-Dimensional Regression and Variable Selection Using CAR Scores

by Verena Zuber, Korbinian Strimmer - Statistical Applications in Genetics and Molecular Biology 10
"... Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion for variable ranking in linear regression based on Mahalanobis-decorrelation of the explanatory variab ..."
Abstract - Cited by 6 (2 self) - Add to MetaCart
Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion for variable ranking in linear regression based on Mahalanobis-decorrelation of the explanatory variables. The CAR score pro-vides a canonical ordering that encourages grouping of correlated predictors and down-weights antagonistic variables. It decomposes the proportion of variance ex-plained and it is an intermediate between marginal correlation and the standardized regression coefficient. As a population quantity, any preferred inference scheme can be applied for its estimation. Using simulations we demonstrate that variable selection by CAR scores is very effective and yields prediction errors and true and false positive rates that compare favorably with modern regression techniques such as elastic net and boosting. We illustrate our approach by analyzing data concerned with diabetes progression and with the effect of aging on gene expression in the human brain. The R package "care " implementing CAR score regression is available from CRAN.
(Show Context)

Citation Context

...es for continuous and CAT scores for categorical response: 1. Prescreen predictor variables using marginal correlations (or t-scores) with an adaptive threshold determined, e.g., by controlling FNDR (=-=Ahdesmäki and Strimmer, 2010-=-). 2. Rank the remaining variables by their squared CAR (or CAT) scores. 3. If desired, group variables and compute grouped CAR (or CAT) scores. Currently, we are studying algorithmic improvements to ...

Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects

by David Donoho, Jiashun Jin
"... In modern high-throughput data analysis, researchers perform a large number of sta-tistical tests, expecting to find perhaps a small fraction of significant effects against a pre-dominantly null background. Higher Criticism (HC) was introduced to determine whether there are any non-zero effects; mor ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
In modern high-throughput data analysis, researchers perform a large number of sta-tistical tests, expecting to find perhaps a small fraction of significant effects against a pre-dominantly null background. Higher Criticism (HC) was introduced to determine whether there are any non-zero effects; more recently, it was applied to feature selection, where it provides a method for selecting useful predictive features from a large body of potentially useful features, among which only a rare few will prove truly useful. In this article, we review the basics of HC in both the testing and feature selection settings. HC is a flexible idea, which adapts easily to new situations; we point out how it adapts to clique detection and bivariate outlier detection. HC, although still early in its development, is seeing increasing interest from practitioners; we illustrate this with worked examples. HC is computationally effective, which gives it a nice leverage in the increasingly more relevant “Big Data ” settings we see today. We also review the underlying theoretical “ideology ” behind HC. The Rare/Weak (RW) model is a theoretical framework simultaneously controlling the size and prevalence of use-ful/significant items among the useless/null bulk. The RW model shows that HC has impor-tant advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER), in particular, certain optimality properties. We discuss the rare/weak phase diagram, a way to visualize clearly the class of RW settings where the true signals are so rare or so weak that detection and feature selection are simply impossible, and a way to understand the known optimality properties of HC. Dedications. To the memory of John W. Tukey 1915–2000 and of Yuri I. Ingster 1946–2012, two pioneers in mathematical statistics. Key words. Classification; control of FDR; feature selection; Higher Criticism; large co-
(Show Context)

Citation Context

...ose to 1. In [42], we studied the optimal level in an asymptotic rare/weak setting, and derived the leading asymptotics of the optimal FDR. In Section 6.2 below we give more detail. In several papers =-=[2, 86, 87]-=-, Strimmer and collaborators compared the approach of feature selection by HCT with both that of control of the FDR and that of control of the False nonDiscovery Rate (FNDR), analytically and also wit...

Differential Protein Expression and Peak Selection in Mass Spectrometry Data by Binary Discriminant Analysis

by Sebastian Gibb , Korbinian Strimmer
"... Abstract Motivation: Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. Results: Here, we intr ..."
Abstract - Add to MetaCart
Abstract Motivation: Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. Results: Here, we introduce a simple yet effective approach for identifying differentially expressed proteins using binary discriminant analysis. This approach works by data-adaptive thresholding of protein expression values and subsequent ranking of the dichotomized features using a relative entropy measure. Our framework may be viewed as a generalization of the 'peak probability contrast' approach of Our approach is computationally inexpensive and shows in the analysis of a large-scale drug discovery test data set equivalent prediction accuracy as a random forest. Furthermore, we were able to identify in the analysis of mass spectrometry data from a pancreas cancer study biological relevant and statistically predictive marker peaks unrecognized in the original study.
(Show Context)

Citation Context

... group probabilities we use observed frequencies πy = ny/n if n is large, and for small n the Stein-type shrinkage estimator of proportions described in Hausser and Strimmer (2009). 2.4 Variable ranking and selection Closely tied in with prediction is the question which variables are most important for successful assignment of a class label, and, conversely, which variables are irrelevant. Especially in large-dimensional problems it is very important to remove the null features as the build-up of random noise from these variables can substantially degrade the overall prediction accuracy (cf. Ahdesmäki and Strimmer, 2010). For ranking features in discriminant analysis with binary variables there have been many, in part contradictory, propositions. For the case of K = 2 groups the following criteria, among others, have been used: • The chi-square statistic of independence between response and predictors (An et al., 2013), • peak probability contrasts |µy1 − µy2 |(Tibshirani et al., 2004), • Quinlan’s information gain measure (Bender et al., 2004), and • ratio of between-group and within-group covariance (Wilbur et al., 2002). See Tan et al. (2004) for many other proposals for measuring associations between cate...

RESEARCH ARTICLE Open Access Sparse PLS discriminant analysis: biologically

by Kim-anh Lê Cao
"... relevant feature selection and graphical displays for multiclass problems Lê Cao et al. ..."
Abstract - Add to MetaCart
relevant feature selection and graphical displays for multiclass problems Lê Cao et al.
(Show Context)

Citation Context

...e. Outline of the paper We will first discuss the number of dimensions to choose in sPLS-DA, and compare its classification performance with multivariate projection-based approaches: variants of sLDA =-=[41]-=-, variants of SPLSDA and with SGPLS from [30]; and with five multiclass wrapper approaches(RFE,NSC,RF,OFW-cart,OFW-svm)on four public multiclass microarray data sets and one public SNP data set. All a...

THRESHOLDING METHODS FOR FEATURE SELECTION IN GENOMICS: HIGHER CRITICISM VERSUS FALSE NON-DISCOVERY RATES

by Bernd Klaus, Korbinian Strimmer
"... In high-dimensional genomic analysis it is often necessary to conduct feature selection, in order to improve prediction accuracy and to obtain interpretable classifiers. Traditionally, feature selection relies on computer-intensive procedures such as cross-validation. However, recently two approache ..."
Abstract - Add to MetaCart
In high-dimensional genomic analysis it is often necessary to conduct feature selection, in order to improve prediction accuracy and to obtain interpretable classifiers. Traditionally, feature selection relies on computer-intensive procedures such as cross-validation. However, recently two approaches have been advocated that both are computationally more efficient: False Non-Discovery Rates (FNDR) and Higher Criticism (HC). Here, we describe the rationale behind the two approaches, conduct an empirical comparison based on synthetic and real data, and discuss the respective merits of HC-based and FNDR-based feature selection. 1.
(Show Context)

Citation Context

...ltiple testing [5; 6; 7] and has become standard in large-scale statistical analysis [8]. Feature selection in classification based on “False NonDiscovery Rates” (FNDR) has been suggested recently in =-=[9]-=-. Both FDR/FNDR and HC-based feature selection assume a null model for the observed test statistics is known, e.g., a normal distribution. Subsequently, a threshold separating null from nonnull featur...

Open Access

by unknown authors
"... Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments Vadim Farztdinov * and Fionnuala McDyer Background: Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different ..."
Abstract - Add to MetaCart
Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments Vadim Farztdinov * and Fionnuala McDyer Background: Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment. Results: A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped
(Show Context)

Citation Context

... calculate the AUC. f This option was chosen because, for extremely highdimensional data, estimating correlation is very difficult and in such instances it is recommended to conduct diagonal analysis =-=[15]-=-. Additional files Additional file 1: DFC_Test._Farztdinov. PDF file containing Appendix to the article with details on the estimation of properties of null features distribution, detailed description...

Traditional Approaches for Image Recognition by ANF Methods

by K. Shirisha
"... Abstract- In this paper, we propose an integrated face recognition system that is robust against facial expressions by combining information from the computed intra person optical flow and the synthesized face image in a probabilistic framework. Making recognition more reliable under uncontrolled li ..."
Abstract - Add to MetaCart
Abstract- In this paper, we propose an integrated face recognition system that is robust against facial expressions by combining information from the computed intra person optical flow and the synthesized face image in a probabilistic framework. Making recognition more reliable under uncontrolled lighting conditions is one of the most important challenges for practical face recognition systems. We tackle this by combining the strengths of robust illumination normalization. Our experimental results show that the proposed system improves the accuracy of face recognition from expressional face images and lighting variations. we propose to develop this paper by using ANF(appearance, normalization, feature) methods.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University