Results 1 - 10
of
21
Gene ranking and biomarker discovery under correlation. Bioinformatics
"... Motivation: Biomarker discovery and gene ranking is a standard task in genomic high throughput analysis.Typically, the ordering of markers is based on a stabilized variant of the t-score, such as the moderated t or the SAM statistic. However, these procedures ignore gene-gene correlations, which may ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
Motivation: Biomarker discovery and gene ranking is a standard task in genomic high throughput analysis.Typically, the ordering of markers is based on a stabilized variant of the t-score, such as the moderated t or the SAM statistic. However, these procedures ignore gene-gene correlations, which may have a profound impact on the gene orderings and on the power of the subsequent tests. Results: We propose a simple procedure that adjusts gene-wise t-statistics to take account of correlations among genes. The resulting correlation-adjusted t-scores (“cat ” scores) are derived from a predictive perspective, i.e. as a score for variable selection to discriminate group membership in two-class linear discriminant analysis. In the absence of correlation the cat score reduces to the standard t-score. Moreover, using the cat score it is straightforward to evaluate groups of features (i.e. gene sets). For computation of the cat score from small sample data we propose a shrinkage procedure. In a comparative study comprising six different synthetic and empirical correlation structures we show that the cat score improves estimation of gene orderings and leads to higher power for fixed true discovery rate, and vice versa. Finally, we also illustrate the cat score by analyzing metabolomic data. Availability: The shrinkage cat score is implemented in the R package “st ” available from
MALDIquant: a versatile R package for the analysis of mass spectrometry data., Bioinformatics 28
, 2012
"... Summary: MALDIquant is an R package providing a complete and modular analysis pipeline for quantitative analysis of mass spectrometry data. MALDIquant is specif-ically designed with application in clinical diagnostics in mind and implements sophisticated routines for importing raw data, preprocessin ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
(Show Context)
Summary: MALDIquant is an R package providing a complete and modular analysis pipeline for quantitative analysis of mass spectrometry data. MALDIquant is specif-ically designed with application in clinical diagnostics in mind and implements sophisticated routines for importing raw data, preprocessing, non-linear peak align-ment, and calibration. It also handles technical replicates as well as spectra with unequal resolution. Availability: MALDIquant and its associated R packages readBrukerFlexData and readMzXmlData are freely available from the R archive CRAN
Signal identification for rare and weak features: higher criticism or false discovery rates? Biostatistics
, 2012
"... Signal identification in large-dimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show th ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
Signal identification in large-dimensional settings is a challenging problem in biostatistics. Recently, the method of higher criticism (HC) was shown to be an effective means for determining appropriate decision thresholds. Here, we study HC from a false discovery rate (FDR) perspective. We show that the HC threshold may be viewed as an approximation to a natural class boundary (CB) in two-class discriminant analysis which in turn is expressible as FDR threshold. We demonstrate that in a rare-weak setting in the region of the phase space where signal identifi-cation is possible both thresholds are practicably indistinguishable, and thus HC thresholding is identical to using a simple local FDR cutoff. The relationship of the HC and CB thresholds and their properties are investigated both analytically and by simulations, and are further compared by application to four cancer gene expression data sets.
High-Dimensional Regression and Variable Selection Using CAR Scores
- Statistical Applications in Genetics and Molecular Biology 10
"... Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion for variable ranking in linear regression based on Mahalanobis-decorrelation of the explanatory variab ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Variable selection is a difficult problem that is particularly challenging in the analysis of high-dimensional genomic data. Here, we introduce the CAR score, a novel and highly effective criterion for variable ranking in linear regression based on Mahalanobis-decorrelation of the explanatory variables. The CAR score pro-vides a canonical ordering that encourages grouping of correlated predictors and down-weights antagonistic variables. It decomposes the proportion of variance ex-plained and it is an intermediate between marginal correlation and the standardized regression coefficient. As a population quantity, any preferred inference scheme can be applied for its estimation. Using simulations we demonstrate that variable selection by CAR scores is very effective and yields prediction errors and true and false positive rates that compare favorably with modern regression techniques such as elastic net and boosting. We illustrate our approach by analyzing data concerned with diabetes progression and with the effect of aging on gene expression in the human brain. The R package "care " implementing CAR score regression is available from CRAN.
Higher Criticism for Large-Scale Inference, Especially for Rare and Weak Effects
"... In modern high-throughput data analysis, researchers perform a large number of sta-tistical tests, expecting to find perhaps a small fraction of significant effects against a pre-dominantly null background. Higher Criticism (HC) was introduced to determine whether there are any non-zero effects; mor ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
In modern high-throughput data analysis, researchers perform a large number of sta-tistical tests, expecting to find perhaps a small fraction of significant effects against a pre-dominantly null background. Higher Criticism (HC) was introduced to determine whether there are any non-zero effects; more recently, it was applied to feature selection, where it provides a method for selecting useful predictive features from a large body of potentially useful features, among which only a rare few will prove truly useful. In this article, we review the basics of HC in both the testing and feature selection settings. HC is a flexible idea, which adapts easily to new situations; we point out how it adapts to clique detection and bivariate outlier detection. HC, although still early in its development, is seeing increasing interest from practitioners; we illustrate this with worked examples. HC is computationally effective, which gives it a nice leverage in the increasingly more relevant “Big Data ” settings we see today. We also review the underlying theoretical “ideology ” behind HC. The Rare/Weak (RW) model is a theoretical framework simultaneously controlling the size and prevalence of use-ful/significant items among the useless/null bulk. The RW model shows that HC has impor-tant advantages over better known procedures such as False Discovery Rate (FDR) control and Family-wise Error control (FwER), in particular, certain optimality properties. We discuss the rare/weak phase diagram, a way to visualize clearly the class of RW settings where the true signals are so rare or so weak that detection and feature selection are simply impossible, and a way to understand the known optimality properties of HC. Dedications. To the memory of John W. Tukey 1915–2000 and of Yuri I. Ingster 1946–2012, two pioneers in mathematical statistics. Key words. Classification; control of FDR; feature selection; Higher Criticism; large co-
Differential Protein Expression and Peak Selection in Mass Spectrometry Data by Binary Discriminant Analysis
"... Abstract Motivation: Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. Results: Here, we intr ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Motivation: Proteomic mass spectrometry analysis is becoming routine in clinical diagnostics, for example to monitor cancer biomarkers using blood samples. However, differential proteomics and identification of peaks relevant for class separation remains challenging. Results: Here, we introduce a simple yet effective approach for identifying differentially expressed proteins using binary discriminant analysis. This approach works by data-adaptive thresholding of protein expression values and subsequent ranking of the dichotomized features using a relative entropy measure. Our framework may be viewed as a generalization of the 'peak probability contrast' approach of Our approach is computationally inexpensive and shows in the analysis of a large-scale drug discovery test data set equivalent prediction accuracy as a random forest. Furthermore, we were able to identify in the analysis of mass spectrometry data from a pancreas cancer study biological relevant and statistically predictive marker peaks unrecognized in the original study.
RESEARCH ARTICLE Open Access Sparse PLS discriminant analysis: biologically
"... relevant feature selection and graphical displays for multiclass problems Lê Cao et al. ..."
Abstract
- Add to MetaCart
(Show Context)
relevant feature selection and graphical displays for multiclass problems Lê Cao et al.
THRESHOLDING METHODS FOR FEATURE SELECTION IN GENOMICS: HIGHER CRITICISM VERSUS FALSE NON-DISCOVERY RATES
"... In high-dimensional genomic analysis it is often necessary to conduct feature selection, in order to improve prediction accuracy and to obtain interpretable classifiers. Traditionally, feature selection relies on computer-intensive procedures such as cross-validation. However, recently two approache ..."
Abstract
- Add to MetaCart
(Show Context)
In high-dimensional genomic analysis it is often necessary to conduct feature selection, in order to improve prediction accuracy and to obtain interpretable classifiers. Traditionally, feature selection relies on computer-intensive procedures such as cross-validation. However, recently two approaches have been advocated that both are computationally more efficient: False Non-Discovery Rates (FNDR) and Higher Criticism (HC). Here, we describe the rationale behind the two approaches, conduct an empirical comparison based on synthetic and real data, and discuss the respective merits of HC-based and FNDR-based feature selection. 1.
Open Access
"... Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments Vadim Farztdinov * and Fionnuala McDyer Background: Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different ..."
Abstract
- Add to MetaCart
(Show Context)
Distributional fold change test – a statistical approach for detecting differential expression in microarray experiments Vadim Farztdinov * and Fionnuala McDyer Background: Because of the large volume of data and the intrinsic variation of data intensity observed in microarray experiments, different statistical methods have been used to systematically extract biological information and to quantify the associated uncertainty. The simplest method to identify differentially expressed genes is to evaluate the ratio of average intensities in two different conditions and consider all genes that differ by more than an arbitrary cut-off value to be differentially expressed. This filtering approach is not a statistical test and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed. At the same time the fold change by itself provide valuable information and it is important to find unambiguous ways of using this information in expression data treatment. Results: A new method of finding differentially expressed genes, called distributional fold change (DFC) test is introduced. The method is based on an analysis of the intensity distribution of all microarray probe sets mapped
Traditional Approaches for Image Recognition by ANF Methods
"... Abstract- In this paper, we propose an integrated face recognition system that is robust against facial expressions by combining information from the computed intra person optical flow and the synthesized face image in a probabilistic framework. Making recognition more reliable under uncontrolled li ..."
Abstract
- Add to MetaCart
Abstract- In this paper, we propose an integrated face recognition system that is robust against facial expressions by combining information from the computed intra person optical flow and the synthesized face image in a probabilistic framework. Making recognition more reliable under uncontrolled lighting conditions is one of the most important challenges for practical face recognition systems. We tackle this by combining the strengths of robust illumination normalization. Our experimental results show that the proposed system improves the accuracy of face recognition from expressional face images and lighting variations. we propose to develop this paper by using ANF(appearance, normalization, feature) methods.