Results 1 - 10
of
65
Spectral Probabilities and Generating Functions of Tandem Mass Spectra: A Strike against Decoy Databases
, 2008
"... A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivative ..."
Abstract
-
Cited by 43 (4 self)
- Add to MetaCart
A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives (spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to ∆-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of “one-hit-wonders ” in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches.
Rapid and accurate peptide identification from tandem mass spectra
- J. Proteome Res
"... Mass spectrometry, the core technology in the field of proteomics, promises to enable scientists to identify and quantify the entire complement of proteins in a complex biological sample. Currently, the primary bottleneck in this type of experiment is computational. Existing algorithms for interpret ..."
Abstract
-
Cited by 30 (12 self)
- Add to MetaCart
Mass spectrometry, the core technology in the field of proteomics, promises to enable scientists to identify and quantify the entire complement of proteins in a complex biological sample. Currently, the primary bottleneck in this type of experiment is computational. Existing algorithms for interpreting mass spectra are slow and fail to identify a large proportion of the given spectra. We describe a database search program called Crux that reimplements and extends the widely used database search program SEQUEST. For speed, Crux uses a peptide indexing scheme to rapidly retrieve candidate peptides for a given spectrum. For each peptide in the target database, Crux generates shuffled decoy peptides on the fly, providing a good null model and, hence, accurate false discovery rate estimates. Crux also implements two recently described postprocessing methods: a p value calculation based upon fitting a Weibull distribution to the observed scores, and a semisupervised method that learns to discriminate between target and decoy matches. Both methods significantly improve the overall rate of peptide identification. Crux is implemented in C and is distributed with source code freely to noncommercial users.
False Discovery Rates of Protein Identifications: A Strike against the
, 2008
"... Most proteomics studies attempt to maximize the number of peptide identifications and subsequently infer proteins containing two or more peptides as reliable protein identifications. In this study, we evaluate the effect of this “two-peptide ” rule on protein identifications, using multiple search t ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Most proteomics studies attempt to maximize the number of peptide identifications and subsequently infer proteins containing two or more peptides as reliable protein identifications. In this study, we evaluate the effect of this “two-peptide ” rule on protein identifications, using multiple search tools and data sets. Contrary to the intuition, the “two-peptide ” rule reduces the number of protein identifications in the target database more significantly than in the decoy database and results in increased false discovery rates, compared to the case when single-hit proteins are not discarded. We therefore recommend that the “two-peptide ” rule should be abandoned, and instead, protein identifications should be subject to the estimation of error rates, as is the case with peptide identifications. We further extend the generating function approach (originally proposed for evaluating matches between a peptide and a single spectrum) to evaluating matches between a protein and an entire spectral data set.
2008b. Posterior error probabilities and false discovery rates: Two sides of the same coin
- J. Proteome Res
"... A variety of methods have been described in the literature for assigning statistical significance to peptides identified via tandem mass spectrometry. Here, we explain how two types of scores, the q-value and the posterior error probability, are related and complementary to one another. ..."
Abstract
-
Cited by 22 (2 self)
- Add to MetaCart
A variety of methods have been described in the literature for assigning statistical significance to peptides identified via tandem mass spectrometry. Here, we explain how two types of scores, the q-value and the posterior error probability, are related and complementary to one another.
Modeling peptide fragmentation with dynamic Bayesian networks for peptide identification
- BIOINFORMATICS VOL. 24 ISMB 2008, PAGES I348–I356
, 2008
"... ..."
Improvements to the Percolator algorithm for peptide identification from shotgun proteomics data sets
- J. Proteome Res
, 2009
"... Abstract: Shotgun proteomics coupled with database search software allows the identification of a large number of peptides in a single experiment. However, some existing search algorithms, such as SEQUEST, use score functions that are designed primarily to identify the best peptide for a given spect ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Abstract: Shotgun proteomics coupled with database search software allows the identification of a large number of peptides in a single experiment. However, some existing search algorithms, such as SEQUEST, use score functions that are designed primarily to identify the best peptide for a given spectrum. Consequently, when comparing identifications across spectra, the SEQUEST score function Xcorr fails to discriminate accurately between correct and incorrect peptide identifications. Several machine learning methods have been proposed to address the resulting classification task of distinguishing between correct and incorrect peptide-spectrum matches (PSMs). A recent example is Percolator, which uses semisupervised learning and a decoy database search strategy to learn to distinguish between correct and incorrect PSMs identified by a database search algorithm. The current work describes three improvements to Percolator. (1) Percolator’s heuristic optimization is replaced with a clear objective function, with intuitive reasons behind its choice. (2) Tractable nonlinear models are used instead of linear models, leading to improved accuracy over the original Percolator. (3) A method, Q-ranker, for directly optimizing the number of identified spectra at a specified q value is proposed, which achieves further gains.
Statistical Calibration of the SEQUEST XCorr Function
- J. Proteome Res. 2009
"... Abstract: Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments requires a score function that consistently ranks correct peptide-spectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract: Obtaining accurate peptide identifications from shotgun proteomics liquid chromatography tandem mass spectrometry (LC-MS/MS) experiments requires a score function that consistently ranks correct peptide-spectrum matches (PSMs) above incorrect matches. We have observed that, for the Sequest score function Xcorr, the inability to discriminate between correct and incorrect PSMs is due in part to spectrum-specific properties of the score distribution. In other words, some spectra score well regardless of which peptides they are scored against, and other spectra score well because they are scored against a large number of peptides. We describe a protocol for calibrating PSM score functions, and we demonstrate its application to Xcorr and the preliminary Sequest score function Sp. The protocol accounts for spectrum- and peptide-specific effects by calculating p values for each spectrum individually, using only that spectrum’s score distribution. We demonstrate that these calculated p values are uniform under a null distribution and therefore accurately measure significance. These p values can be used to estimate the false discovery rate, therefore, eliminating the need for an extra search against a decoy database. In addition, we show that the p values are better calibrated than their underlying scores; consequently, when ranking top-scoring PSMs from multiple spectra, p values are better at discriminating between correct and incorrect PSMs. The calibration protocol is generally applicable to any PSM score function for which an appopriate parametric family can be identified.
SJ: Addressing Statistical Biases in Nucleotide-Derived Protein Databases for Proteogenomic Search Strategies
- Journal of Proteome Research
"... ABSTRACT: Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome fu ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
ABSTRACT: Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannota-tion. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences
Comprehensive proteomic analysis of membrane proteins in Toxoplasma gondii. Molecular & cellular proteomics
- MCP
"... Toxoplasma gondii (T. gondii) is an obligate intracellular protozoan parasite that is an important human and animal pathogen. Experimental information on T. gondii mem-brane proteins is limited, and the majority of gene predic-tions with predicted transmembrane motifs are of un-known function. A sys ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Toxoplasma gondii (T. gondii) is an obligate intracellular protozoan parasite that is an important human and animal pathogen. Experimental information on T. gondii mem-brane proteins is limited, and the majority of gene predic-tions with predicted transmembrane motifs are of un-known function. A systematic analysis of the membrane proteome of T. gondii is important not only for under-standing this parasite’s invasion mechanism(s), but also for the discovery of potential drug targets and new pre-ventative and therapeutic strategies. Here we report a comprehensive analysis of the membrane proteome of T. gondii, employing three proteomics strategies: one-di-mensional gel liquid chromatography-tandem MS analy-sis (one-dimensional gel electrophoresis LC-MS/MS), biotin labeling in conjunction with one-dimensional gel
Liquid chromatography mass spectrometry-based proteomics: biological and technical aspects
- Annals of Applied Statistics
, 2010
"... ar ..."