Results 1 - 10
of
163
A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by . . .
, 2003
"... The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximumlikelihood principle, which clearly satisfies these requirements. The ..."
Abstract
-
Cited by 381 (5 self)
- Add to MetaCart
The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximumlikelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/. [Algorithm; computer simulations; maximum likelihood; phylogeny; rbcL; RDPII project.] The size of homologous sequence data sets has increased dramatically in recent years, and many of these data sets now involve several hundreds of taxa. Moreover, current probabilist...
Dirichlet Mixtures: A Method for Improving Detection of Weak but Significant Protein Sequence Homology
, 1996
"... This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein dat ..."
Abstract
-
Cited by 105 (20 self)
- Add to MetaCart
This paper presents the mathematical foundations of Dirichlet mixtures, which have been used to improve database search results for homologous sequences, when a variable number of sequences from a protein family or domain are known. We present a method for condensing the information in a protein database into a mixture of Dirichlet densities. These mixtures are designed to be combined with observed amino acid frequencies, to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model, or other statistical model. These estimates give a statistical model greater generalization capacity, such that remotely related family members can be more reliably recognized by the model. Dirichlet mixtures have been shown to outperform substitution matrices and other methods for computing these expected amino acid distributions in database search, resulting in fewer false positives and false negatives for the families tested. This paper corrects a previously p...
Human and mouse gene structure: comparative analysis and application to exon prediction
- Genome Res
, 2000
"... service ..."
Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading
, 1996
"... Attractive inter-residue contact energies for proteins have been re-evaluated with the same assumptions and approximations used originally by us in 1985, but with a significantly larger set of protein crystal structures. An additional repulsive packing energy term, operative at higher densities to p ..."
Abstract
-
Cited by 91 (6 self)
- Add to MetaCart
Attractive inter-residue contact energies for proteins have been re-evaluated with the same assumptions and approximations used originally by us in 1985, but with a significantly larger set of protein crystal structures. An additional repulsive packing energy term, operative at higher densities to prevent overpacking, has also been estimated for all 20 amino acids as a function of the number of contacting residues, based on their observed distributions. The two terms of opposite sign are intended to be used together to provide an estimate of the overall energies of inter-residue interactions in simplified proteins without atomic details. To overcome the problem of how to utilize the many homologous proteins in the Protein Data Bank, a new scheme has been devised to assign different weights to each protein, based on similarities among amino acid sequences. A total of 1168 protein structures containing 1661 subunit sequences are actually used here. After the sequence weights have been applied, these correspond to an effective number of residue–residue contacts of 113,914, or about six
PROBCONS: Probabilistic consistency-based multiple sequence alignment
- Genome Res
, 2005
"... To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objec ..."
Abstract
-
Cited by 84 (5 self)
- Add to MetaCart
To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce prob-abilistic consistency, a novel scoring function for multiple sequence comparisons. We present PROBCONS, a practical tool for progressive protein multiple sequence alignment based on prob-abilistic consistency, and evaluate its performance on several standard alignment benchmark datasets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, PROB-CONS achieves statistically significant improvement over other leading methods while maintain-ing practical speed. PROBCONS is publicly available as a web resource. Source code and execu-tables are available under the GNU Public License at
Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families
- PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS FOR MOLECULAR BIOLOGY
, 1993
"... A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixtu ..."
Abstract
-
Cited by 56 (6 self)
- Add to MetaCart
A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixture densities are determined from examination of previously constructed HMMs or multiple alignments. It is shown that this Bayesian method can improve the quality of HMMs produced from small training sets. Specific experiments on the EF-hand motif are reported, for which these priors are shown to produce HMMs with higher likelihood on unseen data, and fewer false positives and false negatives in a database search task.
A Framework for Information Visualization Spreadsheets
, 1999
"... Information has become interactive. Information visualization is the design and creation of interactive graphic depictions of information by combining principles in the disciplines of graphic design, cognitive science, and interactive computer graphics. As the volume and complexity of the data incre ..."
Abstract
-
Cited by 54 (3 self)
- Add to MetaCart
Information has become interactive. Information visualization is the design and creation of interactive graphic depictions of information by combining principles in the disciplines of graphic design, cognitive science, and interactive computer graphics. As the volume and complexity of the data increases, users require more powerful visualization tools that allow them to more effectively explore large abstract datasets. This
A Spreadsheet Approach to Information Visualization
, 1997
"... In information visualization, as the volume and complexity of the data increases, researchers require more powerful visualization tools that enable them to more effectively explore multidimensional datasets. In this paper, we discuss the general utility of a novel visualization spreadsheet framework ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
In information visualization, as the volume and complexity of the data increases, researchers require more powerful visualization tools that enable them to more effectively explore multidimensional datasets. In this paper, we discuss the general utility of a novel visualization spreadsheet framework. Just as a numerical spreadsheetenables exploration of numbers, a visualization spreadsheet enables exploration of visual forms of information. We show that the spreadsheet approach facilitates certain information visualization tasks that are more difficult using other approaches. Unlike traditional spreadsheets, which store only simple data elements and formulas in each cell, a visualization spreadsheetcell can hold anentire complex data set, selection criteria, viewing specifications, and other information needed for a full-fledged information visualization. Similarly, inter-cell operations are far more complex, stretching beyond simple arithmetic and string operations to encompass a range of domain-specific operators. We have built two prototype systems that illustrate some of these research issues. The underlying approach in our work allows domain experts to define new data types and data operations, and enables visualization experts to incorporate new visualizations, viewing parameters, and view operations. 1
Assessing The Performance Of Fold Recognition Methods By Means Of A Comprehensive Benchmark.
- Pac. Symp. Biocomput
, 1996
"... this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a fold-recognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the r ..."
Abstract
-
Cited by 33 (0 self)
- Add to MetaCart
this paper addresses. Our goal is to devise a benchmark that can aid in assessing the performance of a fold-recognition method in an objective, unbiased and thorough way. The benchmark is independent of the representation of the proteins, the compatibility definition, the search algorithm, and the ranking and significance estimation procedures used in the method being evaluated. Thus, it allows a systematic comparison of different methods. Benchmarks are routinely used to assess performance of sequence-sequence alignment (e.g.
Rapid Assessment of Extremal Statistics for Gapped Local Alignment
- Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
, 1999
"... The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts the extrem ..."
Abstract
-
Cited by 28 (8 self)
- Add to MetaCart
The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts the extremal score statistics by using only one to a few pairwise alignments. The success of our method relies crucially on the link between the statistics of island scores and extremal score statistics. This link is motivated by heuristic arguments, and firmly established by extensive numerical simulations for a variety of scoring parameter settings and sequence lengths. Our approach is several orders of magnitude faster than the widely used shuffling method, since island counting is trivially incorporated into the basic Smith-Waterman alignment algorithm with minimal computational cost, and all islands are counted in a single alignment. The availability of a rapid and accurate si...

