Results 1  10
of
244
A Simple, Fast, and Accurate Algorithm to Estimate Large Phylogenies by Maximum Likelihood
, 2003
"... The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximumlikelihood principle, which clearly satisfies these requirements. The ..."
Abstract

Cited by 2182 (27 self)
 Add to MetaCart
The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximumlikelihood principle, which clearly satisfies these requirements. The core of this method is a simple hillclimbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distancebased method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximumlikelihood programs and much higher than the performance of distancebased and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximumlikelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distancebased and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page:
Combining phylogenetic and hidden Markov models in biosequence analysis
 J. Comput. Biol
, 2004
"... A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individ ..."
Abstract

Cited by 135 (13 self)
 Add to MetaCart
(Show Context)
A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individual sites, and hidden Markov models, which allow for changes from site to site. Besides improving the realism of ordinary phylogenetic models, they are potentially very powerful tools for inference and prediction—for gene finding, for example, or prediction of secondary structure. In this paper, we review progress on combined phylogenetic and hidden Markov models and present some extensions to previous work. Our main result is a simple and efficient method for accommodating higherorder states in the HMM, which allows for contextsensitive models of substitution— that is, models that consider the effects of neighboring bases on the pattern of substitution. We present experimental results indicating that higherorder states, autocorrelated rates, and multiple functional categories all lead to significant improvements in the fit of a combined phylogenetic and hidden Markov model, with the effect of higherorder states being particularly pronounced.
Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models
 SYST. BIOL
, 2004
"... What does die posterior probability of a phylogenetic tree mean? This simulation study shows that Bayesian posterior probabilities have the meaning that is typically ascribed to them; the pt>sterkir probability ot'a tree is the probability that the tree is corwct, assuming th>.it the mo ..."
Abstract

Cited by 101 (7 self)
 Add to MetaCart
What does die posterior probability of a phylogenetic tree mean? This simulation study shows that Bayesian posterior probabilities have the meaning that is typically ascribed to them; the pt>sterkir probability ot'a tree is the probability that the tree is corwct, assuming th>.it the model is correct. At the same time, the BayLsian method can be sensitive to model misspecification, and the sensitivity of the Bayesian method appears to be greater than the sensitivity ot " the nonparametric bootstrap method (using maximum likelihood to estimate trees). Although the estimatLs of phylogeny obtained by use of the method of maximum likelihood or the Bayesian method are Ukely to be similar, the assessment of the uncertainty of inferred trees via either bootstriipping (t"or maximum likelihood estimates) or petsterior probabilities (for Bayesian estimates) is not likely to be the same. We suggest that the Bayesian method be implemented with the most complex models of those currently avaiiable, as tliis should reduce the chance that the metliod will concentrate too much probability on tuo few trees. [Bayesian estimation; Markov ch^iin Monte Carlo; posterior probability; prior probability.] Quantify ing the uncertainty of a phylogcneticesti mil te is at least as important a goal as obtaining the phylogenetic estimate itself. Measures of phylogenetic reliability not only point out what parts of a tree can be trusted when interpreting the evolution of a group, but can guide
Computational Identification of Evolutionarily Conserved Exons
, 2004
"... Phylogenetic hidden Markov models (phyloHMMs) have recently been proposed as a means for addressing a multispecies version of the ab initio gene prediction problem. These models allow sequence divergence,a phylogeny,patterns of substitution,and base composition all to be considered simultaneously,i ..."
Abstract

Cited by 62 (13 self)
 Add to MetaCart
Phylogenetic hidden Markov models (phyloHMMs) have recently been proposed as a means for addressing a multispecies version of the ab initio gene prediction problem. These models allow sequence divergence,a phylogeny,patterns of substitution,and base composition all to be considered simultaneously,in a single unified probabilistic model. Here,we apply phyloHMMs to a restricted version of the gene prediction problem in which individual exons are sought that are evolutionarily conserved across a diverse set of species. We discuss two new methods for improving prediction performance: (1) the use of contextdependent phylogenetic models,which capture phenomena such as a strong CpG effect in noncoding regions and a preference for synonymous rather than nonsynonymous substitutions in coding regions; and (2) a novel strategy for incorporating insertions and deletion (indels) into the statetransition structure of the model,which captures the different characteristic patterns of alignment gaps in coding and noncoding regions. We also discuss the technique,previously used in pairwise gene predictors,of explicitly modeling conserved noncoding sequence to help reduce false positive predictions. These methods have been incorporated into an exon prediction program called ExoniPhy, and tested with two large datasets. Experimental results indicate that all three methods produce significant improvements in prediction performance. In combination,they lead to prediction accuracy comparable to that of some of the best available gene predictors,despite several limitations of our current models.
GM: An expectation maximization algorithm for training hidden substitution models
 J Mol Bio
"... We derive an expectation maximization algorithm for maximumlikelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve ov ..."
Abstract

Cited by 55 (10 self)
 Add to MetaCart
(Show Context)
We derive an expectation maximization algorithm for maximumlikelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics.
Models of molecular evolution and phylogeny
 Genome Res
, 1998
"... Phylogenetic reconstruction is a fastgrowing field that is enriched by different statistical approaches and by findings and applications in a broad range of biological areas. Fundamental to these are the mathematical models used to describe the patterns of DNA base substitution and amino acid repla ..."
Abstract

Cited by 51 (0 self)
 Add to MetaCart
Phylogenetic reconstruction is a fastgrowing field that is enriched by different statistical approaches and by findings and applications in a broad range of biological areas. Fundamental to these are the mathematical models used to describe the patterns of DNA base substitution and amino acid replacement. These may become some of the basic models for comparative genome research. We discuss these models, including the analysis of observed DNA base and amino acid mutation patterns, the concept of site heterogeneity, and the incorporation of structural biology data, all of which have become particularly important in recent years. We also describe the use of such models in phylogenetic reconstruction and statistical methods for the comparison of different models. PCR has deeply transformed and boosted phylogenetic studies. At the same time, the statistical analysis of evolutionary relationships among species has recently revealed important biotechnological uses. For example, the understanding of viral quasispecies variation allows us to trace routes of infectious disease transmission. The analysis of the host–
The Importance of Proper Model Assumption in Bayesian Phylogenetics
, 2004
"... We studied the importance of proper model assumption in the context of Bayesian phylogenetics by examining>5,000 Bayesian analyses and six nested models of nucleotide substitution. Model misspecification can strongly bias bipartition posterior probability estimates. These biases were most pronou ..."
Abstract

Cited by 50 (4 self)
 Add to MetaCart
(Show Context)
We studied the importance of proper model assumption in the context of Bayesian phylogenetics by examining>5,000 Bayesian analyses and six nested models of nucleotide substitution. Model misspecification can strongly bias bipartition posterior probability estimates. These biases were most pronounced when rate heterogeneity was ignored. The type of bias seen at a particular bipartition appeared to be strongly influenced by the lengths of the branches surrounding that bipartition. In the Felsenstein zone, posterior probability estimates of bipartitions were biased when the assumed model was underparameterized but were unbiased when the assumed model was overparameterized. For the inverse Felsenstein zone, however, both underparameterization and overparameterization led to biased bipartition posterior probabilities, although the bias caused by overparameterization was less pronounced and disappeared with increased sequence length. Model parameter estimates were also affected by model misspecification. Underparameterization caused a bias in some parameter estimates, such as branch lengths and the gamma shape parameter, whereas overparameterization caused a decrease in the precision of some parameter estimates. We caution researchers to assure that the most appropriate model is assumed by employing both a priori model choice methods and a posteriori model adequacy tests. [Bayesian phylogenetic inference; convergence; Markov chain Monte Carlo; maximum likelihood; model choice; posterior probability.] Model choice is becoming a critical issue as the number of available models of nucleotide evolution increases rapidly. Recent studies have shown that adequate
General TimeReversible Distances with Unequal Rates across Sites: Mixing Γ and Inverse Gaussian Distributions with Invariant Sites
, 1997
"... This paper aims to explain to biologists the assumptions of these distances and to clarify some earlier misconceptions. Importantly, nearly all of the currently used distance estimates (including those of Tamura, 1992; Tamura and Nei, 1994) are special cases (restrictions) of the general timerevers ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
This paper aims to explain to biologists the assumptions of these distances and to clarify some earlier misconceptions. Importantly, nearly all of the currently used distance estimates (including those of Tamura, 1992; Tamura and Nei, 1994) are special cases (restrictions) of the general timereversible distance (see Zharkikh, 1994; Swofford et al., 1996)
Multiplesequence functional annotation and the generalized hidden Markov phylogeny
, 2004
"... Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with featurebased functional annotation me ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
Motivation: Phylogenetic shadowing is a comparative genomics principle that allows for the discovery of conserved regions in sequences from multiple closely related organisms. We develop a formal probabilistic framework for combining phylogenetic shadowing with featurebased functional annotation methods. The resulting model, a generalized hidden Markov phylogeny (GHMP), applies to a variety of situations where functional regions are to be inferred from evolutionary constraints. Results: We show how GHMPs can be used to predict complete shared gene structures in multiple primate sequences. We also describe shadower, our implementation of such a prediction system. We find that shadower outperforms previously reported ab initio gene finders, including comparative human–mouse approaches, on a small sample of diverse exonic regions. Finally, we report on an empirical analysis of shadower’s performance which reveals that as few as five wellchosen species may suffice to attain maximal sensitivity and specificity in exon demarcation. Availability: A Web server is available at
Phylogenetic hidden Markov models
 IN STATISTICAL METHODS IN MOLECULAR EVOLUTION
, 2005
"... Phylogenetic hidden Markov models, or phyloHMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of tw ..."
Abstract

Cited by 37 (6 self)
 Add to MetaCart
(Show Context)
Phylogenetic hidden Markov models, or phyloHMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way this process changes from one site to the next. By treating molecular evolution as a combination of two Markov processes—one that operates in the dimension of space (along a genome) and one that operates in the dimension of time (along the branches of a phylogenetic tree)—these models allow aspects of both sequence structure and sequence evolution to be captured. Moreover, as we will discuss, they permit key computations to be performed exactly and efficiently. PhyloHMMs allow evolutionary information to be brought to bear on a wide variety of problems of sequence “segmentation, ” such as gene prediction and the identification of conserved elements. PhyloHMMs were first proposed as a way of improving phylogenetic models that allow for variation among sites in the rate of substitution [8, 52]. Soon afterward, they were adapted for the problem of secondary structure