Results 1  10
of
175
Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches Over Likelihood Ratio Tests
, 2004
"... Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the sel ..."
Abstract

Cited by 378 (8 self)
 Add to MetaCart
Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (modelaveraged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AICbased model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus (genus Carabus) ground beetles described by Sota and Vogler (2001).
H: Computing Bayes factors using thermodynamic integration
 Syst Biol
"... Abstract.—In the Bayesian paradigm, a common method for comparing two models is to compute the Bayes factor, defined as the ratio of their respective marginal likelihoods. In recent phylogenetic works, the numerical evaluation of marginal likelihoods has often been performed using the harmonic mean ..."
Abstract

Cited by 111 (7 self)
 Add to MetaCart
(Show Context)
Abstract.—In the Bayesian paradigm, a common method for comparing two models is to compute the Bayes factor, defined as the ratio of their respective marginal likelihoods. In recent phylogenetic works, the numerical evaluation of marginal likelihoods has often been performed using the harmonic mean estimation procedure. In the present article, we propose to employ another method, based on an analogy with statistical physics, called thermodynamic integration. We describe the method, propose an implementation, and show on two analytical examples that this numerical method yields reliable estimates. In contrast, the harmonic mean estimator leads to a strong overestimation of the marginal likelihood, which is all the more pronounced as the model is higher dimensional. As a result, the harmonic mean estimator systematically favors more parameterrich models, an artefact that might explain some recent puzzling observations, based on harmonic mean estimates, suggesting that Bayes factors tend to overscore complex models. Finally, we apply our method to the comparison of several alternative models of aminoacid replacement. We confirm our previous observations, indicating that modeling pattern heterogeneity across sites tends to yield better models than standard empirical matrices. [Bayes factor; harmonic mean; mixture model; path sampling; phylogeny; thermodynamic integration.] Bayesian methods have become popular in molecular phylogenetics over the recent years. The simple and intuitive interpretation of the concept of probabilities
Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions
 SYSTEMATIC BIOLOGY
, 2007
"... The estimation of species trees has become popular as a considerable amount of multilocus molecular data is available for inferring the evolutionary history of species. However, the current phylogenetic paradigm, that reconstructs gene trees to represent the species tree suggests that commonly used ..."
Abstract

Cited by 107 (11 self)
 Add to MetaCart
(Show Context)
The estimation of species trees has become popular as a considerable amount of multilocus molecular data is available for inferring the evolutionary history of species. However, the current phylogenetic paradigm, that reconstructs gene trees to represent the species tree suggests that commonly used methods such as the concatenation method, the consensus tree method, or the gene tree parsimony method may be either inconsistent or highly biased. In this paper, we propose a Bayesian hierarchical model to estimate the phylogeny of a group of species using multiple estimated gene tree distributions such as those that arise in a Bayesian analysis of DNA sequence data. Our model employs substitution models used in traditional phylogenetics, but also uses coalescent theory to explain genealogical signals from species trees to gene trees and from gene trees to sequence data, thereby forming a stochastic model to estimate gene trees, species trees, ancestral population sizes and species divergence times simultaneously. Our model is founded on the assumption that gene trees, even of unlinked loci, are correlated due to being derived from a single species tree and therefore should be estimated jointly. We apply the method to two multilocus DNA sequences datasets. The estimates of the
AWTY (Are We There Yet?): a system for graphical exploration of MCMC convergence in Bayesian phylogenetics
, 2007
"... Summary: A key element to a successful Markov chain Monte Carlo (MCMC) inference is the programming and run performance of the Markov chain. However, the explicit use of quality assessments of the MCMC simulations—convergence diagnostics—in phylogenetics is still uncommon. Here we present a simple t ..."
Abstract

Cited by 99 (3 self)
 Add to MetaCart
Summary: A key element to a successful Markov chain Monte Carlo (MCMC) inference is the programming and run performance of the Markov chain. However, the explicit use of quality assessments of the MCMC simulations—convergence diagnostics—in phylogenetics is still uncommon. Here we present a simple tool that uses the output from MCMC simulations and visualizes a number of properties of primary interest in a Bayesian phylogenetic analysis, such as convergence rates of posterior split probabilities and branch lengths. Graphical exploration of the output from phylogenetic MCMC simulations gives intuitive and often crucial information on the success and reliability of the analysis. The tool presented here complements convergence diagnostics already available in other software packages primarily designed for other applications of MCMC. Importantly, the common practice of using traceplots of a single parameter or summary statistic, such as the likelihood score of sampled trees, can be misleading for assessing the success of a phylogenetic MCMC simulation.
2004. Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. Syst. Biol
"... Abstract. — What does die posterior probability of a phylogenetic tree mean? This simulation study shows that Bayesian posterior probabilities have the meaning that is typically ascribed to them; the pt>sterkir probability ot'a tree is the probability that the tree is corwct, assuming th> ..."
Abstract

Cited by 93 (6 self)
 Add to MetaCart
Abstract. — What does die posterior probability of a phylogenetic tree mean? This simulation study shows that Bayesian posterior probabilities have the meaning that is typically ascribed to them; the pt>sterkir probability ot'a tree is the probability that the tree is corwct, assuming th>.it the model is correct. At the same time, the BayLsian method can be sensitive to model misspecification, and the sensitivity of the Bayesian method appears to be greater than the sensitivity ot " the nonparametric bootstrap method (using maximum likelihood to estimate trees). Although the estimatLs of phylogeny obtained by use of the method of maximum likelihood or the Bayesian method are Ukely to be similar, the assessment of the uncertainty of inferred trees via either bootstriipping (t"or maximum likelihood estimates) or petsterior probabilities (for Bayesian estimates) is not likely to be the same. We suggest that the Bayesian method be implemented with the most complex models of those currently avaiiable, as tliis should reduce the chance that the metliod will concentrate too much probability on tuo few trees. [Bayesian estimation; Markov ch^iin Monte Carlo; posterior probability; prior probability.] Quantify ing the uncertainty of a phylogcneticesti mil te is at least as important a goal as obtaining the phylogenetic estimate itself. Measures of phylogenetic reliability not only point out what parts of a tree can be trusted when interpreting the evolution of a group, but can guide
The importance of data partitioning and the utility of Bayes factors in Bayesian phylogenetics. Syst. Biol
, 2007
"... Abstract.—As larger, more complex data sets are being used to infer phylogenies, accuracy of these phylogenies increasingly requires models of evolution that accommodate heterogeneity in the processes of molecular evolution. We investigated the effect of improper data partitioning on phylogenetic ac ..."
Abstract

Cited by 67 (6 self)
 Add to MetaCart
(Show Context)
Abstract.—As larger, more complex data sets are being used to infer phylogenies, accuracy of these phylogenies increasingly requires models of evolution that accommodate heterogeneity in the processes of molecular evolution. We investigated the effect of improper data partitioning on phylogenetic accuracy, as well as the type I error rate and sensitivity of Bayes factors, a commonly used method for choosing among different partitioning strategies in Bayesian analyses. We also used Bayes factors to test empirical data for the need to divide data in a manner that has no expected biological meaning. Posterior probability estimates are misleading when an incorrect partitioning strategy is assumed. The error was greatest when the assumed model was underpartitioned. These results suggest that model partitioning is important for large data sets. Bayes factors performed well, giving a 5 % type I error rate, which is remarkably consistent with standard frequentist hypothesis tests. The sensitivity of Bayes factors was found to be quite high when the acrossclass model heterogeneity reflected that of empirical data. These results suggest that Bayes factors represent a robust method of choosing among partitioning strategies. Lastly, results of tests for the inclusion of unexpected divisions in empirical data mirrored the simulation results, although the outcome of such tests is highly dependent on accounting for rate variation among classes. We conclude by discussing other approaches for partitioning data, as well as other applications of Bayes factors. [Bayes factors; Bayesian phylogenetic inference; data partitioning; model choice; posterior probabilities.] Maximum likelihood (ML) and Bayesian methods of
Improving marginal likelihood estimation for Bayesian phylogenetic model selection. Syst Biol
, 2011
"... Abstract.—The marginal likelihood is commonly used for comparing different evolutionary models in Bayesian phylogenetics and is the central quantity used in computing Bayes Factors for comparing model fit. A popular method for estimating marginal likelihoods, the harmonic mean (HM) method, can be e ..."
Abstract

Cited by 38 (1 self)
 Add to MetaCart
(Show Context)
Abstract.—The marginal likelihood is commonly used for comparing different evolutionary models in Bayesian phylogenetics and is the central quantity used in computing Bayes Factors for comparing model fit. A popular method for estimating marginal likelihoods, the harmonic mean (HM) method, can be easily computed from the output of a Markov chain Monte Carlo analysis but often greatly overestimates the marginal likelihood. The thermodynamic integration (TI) method is much more accurate than the HM method but requires more computation. In this paper, we introduce a new method, steppingstone sampling (SS), which uses importance sampling to estimate each ratio in a series (the “stepping stones”) bridging the posterior and prior distributions. We compare the performance of the SS approach to the TI and HM methods in simulation and using real data. We conclude that the greatly increased accuracy of the SS and TI methods argues for their use instead of the HM method, despite the extra computation needed. [Bayes factor; harmonic mean; phylogenetics, marginal likelihood;
Evolutionary rates, divergence dates, and the performance of mitochondrial genes in Bayesian phylogenetic analysis. Syst. Biol
"... Abstract.—The mitochondrial genome is one of the most frequently used loci in phylogenetic and phylogeographic analyses, and it is becoming increasingly possible to sequence and analyze this genome in its entirety from diverse taxa. However, sequencing the entire genome is not always desirable or f ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
(Show Context)
Abstract.—The mitochondrial genome is one of the most frequently used loci in phylogenetic and phylogeographic analyses, and it is becoming increasingly possible to sequence and analyze this genome in its entirety from diverse taxa. However, sequencing the entire genome is not always desirable or feasible. Which genes should be selected to best infer the evolutionary history of the mitochondria within a group of organisms, and what properties of a gene determine its phylogenetic performance? The current study addresses these questions in a Bayesian phylogenetic framework with reference to a phylogeny of plethodontid and related salamanders derived from 27 complete mitochondrial genomes; this topology is corroborated by nuclear DNA and morphological data. Evolutionary rates for each mitochondrial gene and divergence dates for all nodes in the plethodontid mitochondrial genome phylogeny were estimated in both Bayesian and maximum likelihood frameworks using multiple fossil calibrations, multiple data partitions, and a clockindependent approach. Bayesian analyses of individual genes were performed, and the resulting trees compared against the reference topology. Ordinal logistic regression analysis of molecular evolution rate, gene length, and the Fshape parameter a demonstrated that slower rate of evolution and longer gene length both increased the probability that a gene would perform well phylogenetically. Estimated rates of molecular evolution vary 84fold among different mitochondrial genes and different salamander lineages, and mean rates among genes vary 15fold. Despite having conserved amino acid sequences, coxl, cox2, cox3, and cob have the fastest mean rates of nucleotide substitution, and the greatest variation in rates, whereas rrnS and rrnL have the slowest rates. Reasons
Accurate branch length estimation in partitioned Bayesian analyses requires accommodation of amongpartition rate variation and attention to branch length priors. Syst Biol
, 2006
"... Molecular phylogenetic studies are making increasing use of partitioned Bayesian analyses via software tools like MrBayes, version 3 (Ronquist and Huelsenbeck, 2003). Data partitioning is important because, as long as the same topology/history underlies all of the partitions, it addresses some of t ..."
Abstract

Cited by 29 (0 self)
 Add to MetaCart
(Show Context)
Molecular phylogenetic studies are making increasing use of partitioned Bayesian analyses via software tools like MrBayes, version 3 (Ronquist and Huelsenbeck, 2003). Data partitioning is important because, as long as the same topology/history underlies all of the partitions, it addresses some of the problems associated with the combination of data sets with heterogeneous rates (Bull et al., 1993) and eliminates the need to argue the validity of tests that have been used to judge data combinability (e.g., Huelsenbeck et al., 1994; Huelsenbeck
Data partitions and complex models in Bayesian analysis: the phylogeny of gymnophthalmid lizards
 Syst. Biol
, 2004
"... Abstract.—Phylogenetic studies incorporating multiple loci, and multiple genomes, are becoming increasingly common. Coincident with this trend in genetic sampling, modelbased likelihood techniques including Bayesian phylogenetic methods continue to gain popularity. Few studies, however, have examin ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
(Show Context)
Abstract.—Phylogenetic studies incorporating multiple loci, and multiple genomes, are becoming increasingly common. Coincident with this trend in genetic sampling, modelbased likelihood techniques including Bayesian phylogenetic methods continue to gain popularity. Few studies, however, have examined model fit and sensitivity to such potentially heterogeneous data partitions within combined data analyses using empirical data. Here we investigate the relative model fit and sensitivity of Bayesian phylogenetic methods when alternative sitespecific partitions of amongsite rate variation (with and without autocorrelated rates) are considered. Our primary goal in choosing a bestfit model was to employ the simplest model that was a good fit to the data while optimizing topology and/or Bayesian posterior probabilities. Thus, we were not interested in complex models that did not practically affect our interpretation of the topology under study. We applied these alternative models to a fourgene data set including one proteincoding nuclear gene (cmos), one proteincoding mitochondrial gene (ND4), and two mitochondrial rRNA genes (12S and 16S) for the diverse yet poorly known lizard family Gymnophthalmidae. Our results suggest that the bestfit model partitioned amongsite rate variation separately among the cmos, ND4, and 12S + 16S gene regions. We found this model yielded identical topologies to those from analyses based on the GTR+I+G model, but significantly changed posterior probability estimates of clade support. This partitioned model also produced more precise (less variable) estimates of posterior probabilities across generations of long Bayesian runs, compared to runs employing a GTR+I+G model estimated for the combined data. We use this threeway gamma partitioning in Bayesian analyses to