Results 1  10
of
127
Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches Over Likelihood Ratio Tests
, 2004
"... Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the sel ..."
Abstract

Cited by 378 (8 self)
 Add to MetaCart
Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (modelaveraged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AICbased model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus (genus Carabus) ground beetles described by Sota and Vogler (2001).
Bayesian phylogenetic analysis of combined data
 Syst. Biol
, 2004
"... Abstract. — The recent development of Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) techniques has facilitated the exploration of parameterrich evolutionary models. At the same time, stochastic models have become more realistic (and complex) and have been extended to new typ ..."
Abstract

Cited by 177 (10 self)
 Add to MetaCart
(Show Context)
Abstract. — The recent development of Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) techniques has facilitated the exploration of parameterrich evolutionary models. At the same time, stochastic models have become more realistic (and complex) and have been extended to new types of data, such as morphology. Based on this foundation, we developed a Bayesian MCMC approach to the analysis of combined data sets and explored its utility in inferring relationships among gall wasps based on data from morphology and four genes (nuclear and mitochondrial, ribosomal and protein coding). Examined models range in complexity from those recognizing only a morphological and a molecular partition to those having complex substitution models with independent parameters for each gene. Bayesian MCMC analysis deals efficiently with complex models: convergence occurs faster and more predictably for complex models, mixing is adequate for all parameters even under very complex models, and the parameter update cycle is virtually unaffected by model partitioning across sites. Morphology contributed only 5 % of the characters in the data set but nevertheless influenced the combineddata tree, supporting the utility of morphological data in multigene analyses. We used Bayesian criteria (Bayes factors) to show that process heterogeneity across data partitions is a significant model component, although not as important as amongsite rate variation. More complex evolutionary models are associated with more topological uncertainty and less conflict between morphology and molecules. Bayes factors sometimes favor simpler models over considerably more
A phylogenetic mixture model for detecting patternheterogeneity in gene sequence or characterstate data. Syst. Biol
, 2004
"... Abstract.—We describe a general likelihoodbased ‘mixture model ’ for inferring phylogenetic trees from genesequence or other characterstate data. The model accommodates cases in which different sites in the alignment evolve in qualitatively distinct ways, but does not require prior knowledge of t ..."
Abstract

Cited by 136 (3 self)
 Add to MetaCart
(Show Context)
Abstract.—We describe a general likelihoodbased ‘mixture model ’ for inferring phylogenetic trees from genesequence or other characterstate data. The model accommodates cases in which different sites in the alignment evolve in qualitatively distinct ways, but does not require prior knowledge of these patterns or partitioning of the data. We call this qualitative variability in the pattern of evolution across sites “patternheterogeneity ” to distinguish it from both a homogenous process of evolution and from one characterized principally by differences in rates of evolution. We present studies to show that the model correctly retrieves the signals of patternheterogeneity from simulated genesequence data, and we apply the method to proteincoding genes and to a ribosomal 12S data set. The mixture model outperforms conventional partitioning in both these data sets. We implement the mixture model such that it can simultaneously detect rate and patternheterogeneity. The model simplifies to a homogeneous model or a ratevariability model as special cases, and therefore always performs at least as well as these two approaches, and often considerably improves upon them. We make the model available within a Bayesian Markovchain Monte Carlo framework for phylogenetic inference, as an easytouse computer program. [Bayesian inference; MCMC; mixture model; phylogeny; rateheterogeneity; secondary structure; sequence evolution] The conventional likelihoodbased approach to inferring phylogenetic trees from aligned genesequence or other data is to apply a single substitutional model to
H: Computing Bayes factors using thermodynamic integration
 Syst Biol
"... Abstract.—In the Bayesian paradigm, a common method for comparing two models is to compute the Bayes factor, defined as the ratio of their respective marginal likelihoods. In recent phylogenetic works, the numerical evaluation of marginal likelihoods has often been performed using the harmonic mean ..."
Abstract

Cited by 111 (7 self)
 Add to MetaCart
(Show Context)
Abstract.—In the Bayesian paradigm, a common method for comparing two models is to compute the Bayes factor, defined as the ratio of their respective marginal likelihoods. In recent phylogenetic works, the numerical evaluation of marginal likelihoods has often been performed using the harmonic mean estimation procedure. In the present article, we propose to employ another method, based on an analogy with statistical physics, called thermodynamic integration. We describe the method, propose an implementation, and show on two analytical examples that this numerical method yields reliable estimates. In contrast, the harmonic mean estimator leads to a strong overestimation of the marginal likelihood, which is all the more pronounced as the model is higher dimensional. As a result, the harmonic mean estimator systematically favors more parameterrich models, an artefact that might explain some recent puzzling observations, based on harmonic mean estimates, suggesting that Bayes factors tend to overscore complex models. Finally, we apply our method to the comparison of several alternative models of aminoacid replacement. We confirm our previous observations, indicating that modeling pattern heterogeneity across sites tends to yield better models than standard empirical matrices. [Bayes factor; harmonic mean; mixture model; path sampling; phylogeny; thermodynamic integration.] Bayesian methods have become popular in molecular phylogenetics over the recent years. The simple and intuitive interpretation of the concept of probabilities
The Importance of Proper Model Assumption in Bayesian Phylogenetics
, 2004
"... We studied the importance of proper model assumption in the context of Bayesian phylogenetics by examining>5,000 Bayesian analyses and six nested models of nucleotide substitution. Model misspecification can strongly bias bipartition posterior probability estimates. These biases were most pronou ..."
Abstract

Cited by 49 (4 self)
 Add to MetaCart
We studied the importance of proper model assumption in the context of Bayesian phylogenetics by examining>5,000 Bayesian analyses and six nested models of nucleotide substitution. Model misspecification can strongly bias bipartition posterior probability estimates. These biases were most pronounced when rate heterogeneity was ignored. The type of bias seen at a particular bipartition appeared to be strongly influenced by the lengths of the branches surrounding that bipartition. In the Felsenstein zone, posterior probability estimates of bipartitions were biased when the assumed model was underparameterized but were unbiased when the assumed model was overparameterized. For the inverse Felsenstein zone, however, both underparameterization and overparameterization led to biased bipartition posterior probabilities, although the bias caused by overparameterization was less pronounced and disappeared with increased sequence length. Model parameter estimates were also affected by model misspecification. Underparameterization caused a bias in some parameter estimates, such as branch lengths and the gamma shape parameter, whereas overparameterization caused a decrease in the precision of some parameter estimates. We caution researchers to assure that the most appropriate model is assumed by employing both a priori model choice methods and a posteriori model adequacy tests. [Bayesian phylogenetic inference; convergence; Markov chain Monte Carlo; maximum likelihood; model choice; posterior probability.] Model choice is becoming a critical issue as the number of available models of nucleotide evolution increases rapidly. Recent studies have shown that adequate
Data exploration in phylogenetic inference: scientific, heuristic, or neither
 CLADISTICS
, 2003
"... ..."
Does choice in model selection affect maximum likelihood analysis
 Systematic Biology
"... Abstract.—In order to have confidence in modelbased phylogenetic analysis, the model of nucleotide substitution adopted must be selected in a statistically rigorous manner. Several modelselection methods are applicable to maximum likelihood (ML) analysis, including the hierarchical likelihoodrati ..."
Abstract

Cited by 28 (0 self)
 Add to MetaCart
(Show Context)
Abstract.—In order to have confidence in modelbased phylogenetic analysis, the model of nucleotide substitution adopted must be selected in a statistically rigorous manner. Several modelselection methods are applicable to maximum likelihood (ML) analysis, including the hierarchical likelihoodratio test (hLRT), Akaike information criterion (AIC), Bayesian information criterion (BIC), and decision theory (DT), but their performance relative to empirical data has not been investigated thoroughly. In this study, we use 250 phylogenetic data sets obtained from TreeBASE to examine the effects that choice in model selection has on ML estimation of phylogeny, with an emphasis on optimal topology, bootstrap support, and hypothesis testing. We show that the use of different methods leads to the selection of two or more models for ∼80 % of the data sets and that the AIC typically selects more complex models than alternative approaches. Although ML estimation with different bestfit models results in incongruent tree topologies ∼50 % of the time, these differences are primarily attributable to alternative resolutions of poorly supported nodes. Furthermore, topologies and bootstrap values estimated with ML using alternative statistically supported models are more similar to each other than to topologies and bootstrap values estimated with ML under the Kimura twoparameter (K2P) model or maximum parsimony (MP). In addition, SwoffordOlsenWaddellHillis (SOWH) tests indicate that ML trees estimated with alternative bestfit models are usually not significantly different from each other when evaluated with the same model. However, ML trees estimated with statistically supported models are often significantly suboptimal to ML trees made with the K2P model when both are evaluated with
Data partitions and complex models in Bayesian analysis: the phylogeny of gymnophthalmid lizards
 Syst. Biol
, 2004
"... Abstract.—Phylogenetic studies incorporating multiple loci, and multiple genomes, are becoming increasingly common. Coincident with this trend in genetic sampling, modelbased likelihood techniques including Bayesian phylogenetic methods continue to gain popularity. Few studies, however, have examin ..."
Abstract

Cited by 27 (0 self)
 Add to MetaCart
(Show Context)
Abstract.—Phylogenetic studies incorporating multiple loci, and multiple genomes, are becoming increasingly common. Coincident with this trend in genetic sampling, modelbased likelihood techniques including Bayesian phylogenetic methods continue to gain popularity. Few studies, however, have examined model fit and sensitivity to such potentially heterogeneous data partitions within combined data analyses using empirical data. Here we investigate the relative model fit and sensitivity of Bayesian phylogenetic methods when alternative sitespecific partitions of amongsite rate variation (with and without autocorrelated rates) are considered. Our primary goal in choosing a bestfit model was to employ the simplest model that was a good fit to the data while optimizing topology and/or Bayesian posterior probabilities. Thus, we were not interested in complex models that did not practically affect our interpretation of the topology under study. We applied these alternative models to a fourgene data set including one proteincoding nuclear gene (cmos), one proteincoding mitochondrial gene (ND4), and two mitochondrial rRNA genes (12S and 16S) for the diverse yet poorly known lizard family Gymnophthalmidae. Our results suggest that the bestfit model partitioned amongsite rate variation separately among the cmos, ND4, and 12S + 16S gene regions. We found this model yielded identical topologies to those from analyses based on the GTR+I+G model, but significantly changed posterior probability estimates of clade support. This partitioned model also produced more precise (less variable) estimates of posterior probabilities across generations of long Bayesian runs, compared to runs employing a GTR+I+G model estimated for the combined data. We use this threeway gamma partitioning in Bayesian analyses to
Optimal Data Partitioning and a Test Case for RayFinned Fishes (Actinopterygii) Based on Ten Nuclear Loci' Syst Biol 57(4
, 2008
"... This Article is brought to you for free and open access by the Department of Biology at ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
(Show Context)
This Article is brought to you for free and open access by the Department of Biology at
Analysis and visualization of tree space
 Systematic Biology
, 2005
"... Abstract.—We explored the use of multidimensional scaling (MDS) of treetotree pairwise distances to visualize the relationships among sets of phylogenetic trees. We found the technique to be useful for exploring “tree islands ” (sets of topologically related trees among larger sets of nearoptima ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Abstract.—We explored the use of multidimensional scaling (MDS) of treetotree pairwise distances to visualize the relationships among sets of phylogenetic trees. We found the technique to be useful for exploring “tree islands ” (sets of topologically related trees among larger sets of nearoptimal trees), for comparing sets of trees obtained from bootstrapping and Bayesian sampling, for comparing trees obtained from the analysis of several different genes, and for comparing multiple Bayesian analyses. The technique was also useful as a teaching aid for illustrating the progress of a Bayesian analysis and as an exploratory tool for examining large sets of phylogenetic trees. We also identified some limitations to the method, including distortions of the multidimensional tree space into two dimensions through the MDS technique, and the definition of the MDSdefined space based on a limited sample of trees. Nonetheless, the technique is a useful approach for the analysis of large sets of phylogenetic trees. [Bayesian analysis; multidimensional scaling; phylogenetic analysis; tree space; visualization.] Systematists are often faced with the need to analyze a large collection of phylogenetic trees. These trees may represent a collection of equally parsimonious solutions to a phylogenetic problem, or a set of trees of similar likelihood, or a sampled set of trees from a Markov