Results 11  20
of
39
Computational grand challenges in assembling the tree of life: Problems & solutions
 THE IEEE AND ACM SUPERCOMPUTING CONFERENCE 2005 (SC2005) TUTORIAL
, 2005
"... The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in reasonable ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
The computation of ever larger as well as more accurate phylogenetic (evolutionary) trees with the ultimate goal to compute the tree of life represents one of the grand challenges in High Performance Computing (HPC) Bioinformatics. Unfortunately, the size of trees which can be computed in reasonable time based on elaborate evolutionary models is limited by the severe computational cost inherent to these methods. There exist two orthogonal research directions to overcome this challenging computational burden: First, the development of novel, faster, and more accurate heuristic algorithms and second, the application of high performance computing techniques. The goal of this chapter is to provide a comprehensive introduction to the field of computational evolutionary biology to an audience with computing background, interested in participating in research and/or commercial applications of this field. Moreover, we will cover leadingedge technical and algorithmic developments in the field and discuss open problems and potential solutions.
New approaches to phylogenetic tree search and their application to large numbers of protein alignments. Syst Biol
"... Abstract.—Phylogenetic tree estimation plays a critical role in a wide variety of molecular studies, including molecular systematics, phylogenetics, and comparative genomics. Finding the optimal tree relating a set of sequences using scorebased (optimality criterion) methods, such as maximum likel ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
Abstract.—Phylogenetic tree estimation plays a critical role in a wide variety of molecular studies, including molecular systematics, phylogenetics, and comparative genomics. Finding the optimal tree relating a set of sequences using scorebased (optimality criterion) methods, such as maximum likelihood and maximum parsimony, may require all possible trees to be considered, which is not feasible even for modest numbers of sequences. In practice, trees are estimated using heuristics that represent a tradeoff between topological accuracy and speed. I present a series of novel algorithms suitable for scorebased phylogenetic tree reconstruction that demonstrably improve the accuracy of tree estimates while maintaining high computational speeds. The heuristics function by allowing the efficient exploration of large numbers of trees through novel hillclimbing and resampling strategies. These heuristics, and other computational approximations, are implemented for maximum likelihood estimation of trees in the program Leaphy, and its performance is compared to other popular phylogenetic programs. Trees are estimated from 4059 different protein alignments using a selection of phylogenetic programs and the likelihoods of the tree estimates are compared. Trees estimated using Leaphy are found to have equal to or better likelihoods than trees estimated using other phylogenetic programs in 4004 (98.6%) families and provide a unique best tree that no other program found in 1102 (27.1%) families. The improvement is particularly marked for larger families (80 to 100 sequences), where Leaphy finds a unique best tree in 81.7 % of families. [Algorithms; evolution; phylogenetic tree inference; tree estimation heuristics.]
Searching for convergence in phylogenetic Markov chain Monte Carlo, Syst
 Biol
"... Abstract. — Markov chain Monte Carlo (MCMC) is a methodology that is gaining widespread use in the phylogenetics community and is central to phylogenetic software packages such as MrBayes. An important issue for users of MCMC methods is how to select appropriate values for adjustable parameters such ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract. — Markov chain Monte Carlo (MCMC) is a methodology that is gaining widespread use in the phylogenetics community and is central to phylogenetic software packages such as MrBayes. An important issue for users of MCMC methods is how to select appropriate values for adjustable parameters such as the length of the Markov chain or chains, the sampling density, the proposal mechanism, and, if Metropoliscoupled MCMC is being used, the number of heated chains and their temperatures. Although some parameter settings have been examined in detail in the literature, others are frequently chosen with more regard to computational time or personal experience with other data sets. Such choices may lead to inadequate sampling of tree space or an inefficient use of computational resources. We performed a detailed study of convergence and mixing for 70 randomly selected, putatively orthologous protein sets with different sizes and taxonomic compositions. Replicated runs from multiple random starting points permit a more rigorous assessment of convergence, and we developed two novel statistics, 8 and e, for this purpose. Although likelihood values invariably stabilized quickly, adequate sampling of the posterior distribution of tree topologies took considerably longer. Our results suggest that multimodality is common for data sets with 30 or more taxa and that this results in slow convergence and mixing. However, we also found that the pragmatic approach of combining data from several short, replicated runs into a "metachain " to estimate bipartition posterior probabilities provided good approximations, and that such estimates were no worse in approximating a reference posterior distribution than those obtained using a single long run of the same length as
Increasing the Efficiency of Searches for the Maximum Likelihood Tree in a Phylogenetic Analysis of up to 150 Nucleotide Sequences. Systematic Biology 56
, 2007
"... Abstract. — Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract. — Even when the maximum likelihood (ML) tree is a better estimate of the true phylogenetic tree than those produced by other methods, the result of a poor ML search may be no better than that of a more thorough search under some faster criterion. The ability to find the globally optimal ML tree is therefore important. Here, I compare a range of heuristic search strategies (and their associated computer programs) in terms of their success at locating the ML tree for 20 empirical data sets with 14 to 158 sequences and 411 to 120,762 aligned nucleotides. Three distinct topics are discussed: the success of the search strategies in relation to certain features of the data, the generation of starting trees for the search, and the exploration of multiple islands of trees. As a starting tree, there was little difference among the neighborjoining tree based on absolute differences (including the BioNJ tree), the stepwiseaddition parsimony tree (with or without nearestneighborinterchange (NNI) branch swapping), and the stepwiseaddition ML tree. The latter produced the best ML score on average but was orders of magnitude slower than the alternatives. The BioNJ tree was second best on average. As search strategies, star decomposition and quartet puzzling were the slowest and produced the worst ML scores. The DPRml, IQPNNI, MultiPhyl, PhyML, PhyNav, and TreeFinder programs with default options produced qualitatively similar results, each locating a single tree that tended to be in an NNI suboptimum (rather than the global optimum) when the data set had low phylogenetic information. For such data sets, there were multiple tree islands with very similar ML scores. The likelihood surface only became relatively simple for data sets that contained approximately 500 aligned nucleotides for 50 sequences and 3,000
QuartetS: a fast and accurate algorithm for largescale orthology detection
 Nucleic Acids Res
, 2011
"... detection ..."
(Show Context)
Reconstructing posterior distributions of a species . . .
, 2006
"... The desire to infer the evolutionary history of a group of species should be more viable now that a considerable amount of multilocus molecular data is available. However, the current molecular phylogenetic paradigm still reconstructs gene trees to represent the species tree. Further, commonly used ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
The desire to infer the evolutionary history of a group of species should be more viable now that a considerable amount of multilocus molecular data is available. However, the current molecular phylogenetic paradigm still reconstructs gene trees to represent the species tree. Further, commonly used methods to combine data, such as the concatenation method, the consensus tree method, or the gene tree parsimony method may be biased. In this dissertation, I propose a Bayesian hierarchical model to estimate the phylogeny of a group of species using multiple estimated gene tree distributions such as those that arise in a Bayesian analysis of DNA sequence data. The model employs substitution models used in traditional phylogenetics, but also uses coalescent theory to explain genealogical signals from species trees to gene trees and from gene trees to sequence data, thereby forming a complete stochastic model to simultaneously estimate gene trees, species trees, ancestral population sizes, and species divergence times. The proposed model is founded on the assumption that gene trees, even of unlinked loci, are correlated due to being derived from a single
Bayesian Phylogeny Analysis via Stochastic Approximation Monte
, 2008
"... Monte Carlo methods have received much attention in the recent literature of phylogeny analysis. However, the conventional Markov chain Monte Carlo algorithms, such as the MetropolisHastings algorithm, tend to get trapped in a local energy minimum in simulating from the posterior distribution of ph ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Monte Carlo methods have received much attention in the recent literature of phylogeny analysis. However, the conventional Markov chain Monte Carlo algorithms, such as the MetropolisHastings algorithm, tend to get trapped in a local energy minimum in simulating from the posterior distribution of phylogenetic trees, rendering the inference ineffective. In this paper, we apply an advanced Monte Carlo algorithm, the stochastic approximation Monte Carlo algorithm, to Bayesian phylogeny analysis. Our method is compared with two popular Bayesian phylogeny software, BAMBE and MrBayes, on simulated and real datasets. The numerical results favor to our method, which tends to produce better consensus trees and more accurate estimates for the parameters of the sequence evolutionary model, but uses less CPU time, than do the methods under comparison.
Algorithms for Phylogenetic Tree Reconstruction
 in Proceeding of the 2nd International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences
, 2000
"... Scientists often wish to use the information contained in the DNA sequences of a collection of organisms, or taxa, to infer the evolutionary relationships among those taxa. These evolutionary relationships are generally represented by a labeled binary tree, called a phylogenetic tree. The phylogeny ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Scientists often wish to use the information contained in the DNA sequences of a collection of organisms, or taxa, to infer the evolutionary relationships among those taxa. These evolutionary relationships are generally represented by a labeled binary tree, called a phylogenetic tree. The phylogeny reconstruction problem is computationally difficult because the number of possible solutions increases rapidly with the number of taxa to be included in the tree. For example, for 30 taxa (a moderate number of taxa given the wide availability of DNA sequence data today!), there are 8:69 \Theta 10 36 possible unrooted bifurcating trees to be considered. Thus, scientists are faced with the task of choosing from among a large number of trees that tree(s) which gives the "best" representation of the evolutionary relationships among the taxa based on the data. Often, optimality criteria are applied to allow trees to be compared with one another. In this case, the phylogenetic tree reconstruction problem becomes that of finding the particular tree(s) that optimize the criteria. Exhaustive search of the space of phylogenetic trees is generally not possible for more than 11 taxa, and so algorithms for efficiently searching the space of trees must be developed. Branchandbound methods can reasonably be applied for up to about 20 taxa, but no computationally efficient exact algorithms have been developed for problems larger than this. Therefore, scientists generally rely on heuristic algorithms, such as stepwiseaddition and star decomposition methods. However, such algorithms generally involve a prohibitive amount of computation time for large problems and often find trees that are only locally optimal. Recently, stochastic search algorithms, such as simulated annealing, genetic al...
Building large phylogenetic trees on coarsegrained parallel machines
 ALGORITHMICA
, 2006
"... Phylogenetic analysis is an area of computational biology concerned with the reconstruction of evolutionary relationships between organisms, genes, and gene families. Maximum likelihood evaluation has proven to be one of the most reliable methods for constructing phylogenetic trees. The huge compu ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Phylogenetic analysis is an area of computational biology concerned with the reconstruction of evolutionary relationships between organisms, genes, and gene families. Maximum likelihood evaluation has proven to be one of the most reliable methods for constructing phylogenetic trees. The huge computational requirements associated with maximum likelihood analysis means that it is not feasible to produce large phylogenetic trees using a single processor. We have completed a fully cross platform coarsegrained distributed application, DPRml, which overcomes many of the limitations imposed by the current set of parallel phylogenetic programs. We have completed a set of efficiency tests that show how to maximise efficiency while using the program to build large phylogenetic trees. The software is publicly available under the terms of the GNU general public licence from the system webpage at
Efficient Tree Searches with Available Algorithms
, 2007
"... Phylogenetic methods based on optimality criteria are highly desirable for their logic properties, but timeconsuming when compared to other methods of tree construction. Traditionally, researchers have been limited to exploring tree space by using multiple replicates of Wagner addition followed by ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Phylogenetic methods based on optimality criteria are highly desirable for their logic properties, but timeconsuming when compared to other methods of tree construction. Traditionally, researchers have been limited to exploring tree space by using multiple replicates of Wagner addition followed by typical hill climbing algorithms such as SPR or/and TBR branch swapping but these methods have been shown to be insuficient for “large” data sets (or even for small data sets with a complex tree space). Here, I review different algorithms and search strategies used for phylogenetic analysis with the aim of clarifying certain aspects of this important part of the phylogenetic inference exercise. The techniques discussed here apply to both major families of methods based on optimality criteria—parsimony and maximum likelihood—and allow the thorough analysis of complex data sets with hundreds to thousands of terminal taxa. A new technique, called preprocessed searches is proposed for reusing phylogenetic results obtained in previous analyses, to increase the applicability of the previously proposed jumpstarting phylogenetics method. This article is aimed to serve as an educational and algorithmic reference to biologists interested in phylogenetic analysis. Rationale In phylogenetic analysis, numerical methods are preferred over other methods because of their efficiency and repeatability. Within numerical methods, those based on optimality criteria are to be preferred because they allow for hypothesis testing and tree comparisons based on objective measures. However,