Results 1 - 10
of
1,826
Bayesian Data Analysis
, 1995
"... I actually own a copy of Harold Jeffreys’s Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng and Ste ..."
Abstract
-
Cited by 2194 (63 self)
- Add to MetaCart
(Show Context)
I actually own a copy of Harold Jeffreys’s Theory of Probability but have only read small bits of it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classical chi-squared p-value when he wanted to check the misfit of a model to data (Gelman, Meng and Stern, 2006). I do, however, feel that it is important to understand where our probability models come from, and I welcome the opportunity to use the present article by Robert, Chopin and Rousseau as a platform for further discussion of foundational issues. 2 In this brief discussion I will argue the following: (1) in thinking about prior distributions, we should go beyond Jeffreys’s principles and move toward weakly informative priors; (2) it is natural for those of us who work in social and computational sciences to favor complex models, contra Jeffreys’s preference for simplicity; and (3) a key generalization of Jeffreys’s ideas is to explicitly include model checking in the process of data analysis.
Bayesian Model Selection in Social Research (with Discussion by Andrew Gelman & Donald B. Rubin, and Robert M. Hauser, and a Rejoinder)
- SOCIOLOGICAL METHODOLOGY 1995, EDITED BY PETER V. MARSDEN, CAMBRIDGE,; MASS.: BLACKWELLS.
, 1995
"... It is argued that P-values and the tests based upon them give unsatisfactory results, especially in large samples. It is shown that, in regression, when there are many candidate independent variables, standard variable selection procedures can give very misleading results. Also, by selecting a singl ..."
Abstract
-
Cited by 585 (21 self)
- Add to MetaCart
It is argued that P-values and the tests based upon them give unsatisfactory results, especially in large samples. It is shown that, in regression, when there are many candidate independent variables, standard variable selection procedures can give very misleading results. Also, by selecting a single model, they ignore model uncertainty and so underestimate the uncertainty about quantities of interest. The Bayesian approach to hypothesis testing, model selection and accounting for model uncertainty is presented. Implementing this is straightforward using the simple and accurate BIC approximation, and can be done using the output from standard software. Specific results are presented for most of the types of model commonly used in sociology. It is shown that this approach overcomes the difficulties with P values and standard model selection procedures based on them. It also allows easy comparison of non-nested models, and permits the quantification of the evidence for a null hypothesis...
Model-Based Clustering, Discriminant Analysis, and Density Estimation
- JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2000
"... Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little ..."
Abstract
-
Cited by 573 (29 self)
- Add to MetaCart
Cluster analysis is the automated search for groups of related observations in a data set. Most clustering done in practice is based largely on heuristic but intuitively reasonable procedures and most clustering methods available in commercial software are also of this type. However, there is little systematic guidance associated with these methods for solving important practical questions that arise in cluster analysis, such as \How many clusters are there?", "Which clustering method should be used?" and \How should outliers be handled?". We outline a general methodology for model-based clustering that provides a principled statistical approach to these issues. We also show that this can be useful for other problems in multivariate analysis, such as discriminant analysis and multivariate density estimation. We give examples from medical diagnosis, mineeld detection, cluster recovery from noisy data, and spatial density estimation. Finally, we mention limitations of the methodology, a...
Genomic control for association studies
, 1999
"... A dense set of single nucleotide polymorphisms (SNP) covering the genome and an efficient method to assess SNP genotypes are expected to be available in the near future. An outstanding question is how to use these technologies efficiently to identify genes affecting liability to complex disorders. ..."
Abstract
-
Cited by 480 (13 self)
- Add to MetaCart
(Show Context)
A dense set of single nucleotide polymorphisms (SNP) covering the genome and an efficient method to assess SNP genotypes are expected to be available in the near future. An outstanding question is how to use these technologies efficiently to identify genes affecting liability to complex disorders. To achieve this goal, we propose a statistical method that has several optimal properties: It can be used with casecontrol data and yet, like family-based designs, controls for population heterogeneity; it is insensitive to the usual violations of model assumptions, such as cases failing to be strictly independent; and, by using Bayesian outlier methods, it circumvents the need for Bonferroni correction for multiple tests, leading to better performance in many settings while still constraining risk for false positives. The performance of our genomic control method is quite good for plausible effects of liability genes, which bodes well for future genetic analyses of complex disorders.
How many clusters? Which clustering method? Answers via model-based cluster analysis
- THE COMPUTER JOURNAL
, 1998
"... ..."
Has the U.S. Economy Become More Stable? A Bayesian Approach Based on a Markov-Switching Model of Business Cycle
, 1999
"... We hope to be able to provide answers to the following questions: 1) Has there been a structural break in postwar U.S. real GDP growth toward more stabilization? 2) If so, when would it have been? 3) What's the nature of the structural break? For this purpose, we employ a Bayesian approach to d ..."
Abstract
-
Cited by 426 (15 self)
- Add to MetaCart
We hope to be able to provide answers to the following questions: 1) Has there been a structural break in postwar U.S. real GDP growth toward more stabilization? 2) If so, when would it have been? 3) What's the nature of the structural break? For this purpose, we employ a Bayesian approach to dealing with structural break at an unknown changepoint in a Markov-switching model of business cycle. Empirical results suggest that there has been a structural break in U.S. real GDP growth toward more stabilization, with the posterior mode of the break date around 1984:1. Furthermore, we find a narrowing gap between growth rates during recessions and booms is at least as important as a decline in the volatility of shocks.
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract
-
Cited by 408 (0 self)
- Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches Over Likelihood Ratio Tests
, 2004
"... Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the sel ..."
Abstract
-
Cited by 407 (8 self)
- Add to MetaCart
Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (model-averaged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AIC-based model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus (genus Carabus) ground beetles described by Sota and Vogler (2001).
Strictly Proper Scoring Rules, Prediction, and Estimation
, 2007
"... Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he ..."
Abstract
-
Cited by 373 (28 self)
- Add to MetaCart
(Show Context)
Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he or she issues the probabilistic forecast F, rather than G ̸ = F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the Savage representation. Examples of scoring rules for probabilistic forecasts in the form of predictive densities include the logarithmic, spherical, pseudospherical, and quadratic scores. The continuous ranked probability score applies to probabilistic forecasts that take the form of predictive cumulative distribution functions. It generalizes the absolute error and forms a special case of a new and very general type of score, the energy score. Like many other scoring rules, the energy score admits a kernel representation in terms of negative definite functions, with links to inequalities of Hoeffding type, in both univariate and multivariate settings. Proper scoring rules for quantile and interval forecasts are also discussed. We relate proper scoring rules to Bayes factors and to cross-validation, and propose a novel form of cross-validation known as random-fold cross-validation. A case study on probabilistic weather forecasts in the North American Pacific Northwest illustrates the importance of propriety. We note optimum score approaches to point and quantile