Results 1  10
of
39
BCourse: A WebBased Tool For Bayesian And Causal Data Analysis
, 2002
"... this paper we discuss both the theoretical design principles underlying the BCourse tool, and the pragmatic methods adopted in the implementation of the software ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
this paper we discuss both the theoretical design principles underlying the BCourse tool, and the pragmatic methods adopted in the implementation of the software
On discriminative Bayesian network classifiers and logistic regression
 Machine Learning
"... Abstract. Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graphtheor ..."
Abstract

Cited by 24 (1 self)
 Add to MetaCart
Abstract. Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graphtheoretic property. The property holds for naive Bayes but also for more complex structures such as treeaugmented naive Bayes (TAN) as well as for mixed diagnosticdiscriminative structures. Our results imply that for networks satisfying our property, the conditional likelihood cannot have local maxima so that the global maximum can be found by simple local optimization methods. We also show that if this property does not hold, then in general the conditional likelihood can have local, nonglobal maxima. We illustrate our theoretical results by empirical experiments with local optimization in a conditional naive Bayes model. Furthermore, we provide a heuristic strategy for pruning the number of parameters and relevant features in such models. For many data sets, we obtain good results with heavily pruned submodels containing many fewer parameters than the original naive Bayes model.
On supervised selection of Bayesian networks
 In UAI99
, 1999
"... Given a set of possible models (e.g., Bayesian network structures) and a data sample, in the unsupervised model selection problem the task is to choose the most accurate model with respect to the domain joint probability distribution. In contrast to this, in supervised model selection it is a priori ..."
Abstract

Cited by 22 (6 self)
 Add to MetaCart
Given a set of possible models (e.g., Bayesian network structures) and a data sample, in the unsupervised model selection problem the task is to choose the most accurate model with respect to the domain joint probability distribution. In contrast to this, in supervised model selection it is a priori known that the chosen model will be used in the future for prediction tasks involving more \focused &quot; predictive distributions. Although focused predictive distributions can be produced from the joint probability distribution by marginalization, in practice the best model in the unsupervised sense does not necessarily perform well in supervised domains. In particular, the standard marginal likelihood score is a criterion for the unsupervised task, and, although frequently used for supervised model selection also, does not perform well in such tasks. In this paper we study the performance of the marginal likelihood score empirically in supervised Bayesian network selection tasks by using a large number of publicly available classi cation data sets, and compare the results to those obtained by alternative model selection criteria, including empirical crossvalidation methods, an approximation of a supervised marginal likelihood measure, and a supervised version of Dawid's prequential (predictive sequential) principle. The results demonstrate that the marginal likelihood score does not perform well for supervised model selection, while the best results are obtained by using Dawid's prequential approach.
Efficient Computation of Stochastic Complexity
 Proceedings of the Ninth International Conference on Artificial Intelligence and Statistics
, 2003
"... Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. Unfortunately, computing ..."
Abstract

Cited by 18 (11 self)
 Add to MetaCart
(Show Context)
Stochastic complexity of a data set is defined as the shortest possible code length for the data obtainable by using some fixed set of models. This measure is of great theoretical and practical importance as a tool for tasks such as model selection or data clustering. Unfortunately, computing the modern version of stochastic complexity, defined as the Normalized Maximum Likelihood (NML) criterion, requires computing a sum with an exponential number of terms. Therefore, in order to be able to apply the stochastic complexity measure in practice, in most cases it has to be approximated. In this paper, we show that for some interesting and important cases with multinomial data sets, the exponentiality can be removed without loss of accuracy. We also introduce a new computationally efficient approximation scheme based on analytic combinatorics and assess its accuracy, together with earlier approximations, by comparing them to the exact form.
BAYDA: Software for Bayesian Classification and Feature Selection
 Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD98
, 1998
"... BAYDA is a software package for flexible data analysis in predictive data mining tasks. The mathematical model underlying the program is based on a simple Bayesian network, the Naive Bayes classifier. It is wellknown that the Naive Bayes classifier performs well in predictive data mining tasks, whe ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
BAYDA is a software package for flexible data analysis in predictive data mining tasks. The mathematical model underlying the program is based on a simple Bayesian network, the Naive Bayes classifier. It is wellknown that the Naive Bayes classifier performs well in predictive data mining tasks, when compared to approaches using more complex models. However, the model makes strong independenceassumptions that are frequently violated in practice. For this reason, the BAYDA software also provides a feature selection scheme which can be used for analyzing the problem domain, and for improving the prediction accuracy of the models constructed by BAYDA. The scheme is based on a novel Bayesian feature selection criterion introduced in this paper. The suggested criterion is inspired by the CheesemanStutz approximation for computing the marginal likelihood of Bayesiannetworks with hidden variables. The empirical results with several widelyused data sets demonstrate that the automated Bayesian...
When Ignorance is Bliss
 UAI 2004
, 2004
"... It is commonlyaccepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a nonBayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability m ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
It is commonlyaccepted wisdom that more information is better, and that information should never be ignored. Here we argue, using both a Bayesian and a nonBayesian analysis, that in some situations you are better off ignoring information if your uncertainty is represented by a set of probability measures. These include situations in which the information is relevant for the prediction task at hand. In the nonBayesian analysis, we show how ignoring information avoids dilation, the phenomenon that additional pieces of information sometimes lead to an increase in uncertainty. In the Bayesian analysis, we show that for small sample sizes and certain prediction tasks, the Bayesian posterior based on a noninformative prior yields worse predictions than simply ignoring the given information.
Bayes Optimal InstanceBased Learning
 MACHINE LEARNING: ECML98, PROCEEDINGS OF THE 10TH EUROPEAN CONFERENCE, VOLUME 1398 OF LECTURE
, 1998
"... In this paper we present a probabilistic formalization of the instancebased learning approach. In our Bayesian framework, moving from the construction of an explicit hypothesis to a datadriven instancebased learning approach, is equivalent to averaging over all the (possibly infinitely many) indiv ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
In this paper we present a probabilistic formalization of the instancebased learning approach. In our Bayesian framework, moving from the construction of an explicit hypothesis to a datadriven instancebased learning approach, is equivalent to averaging over all the (possibly infinitely many) individual models. The general Bayesian instancebased learning framework described in this paper can be applied with any set of assumptions defining a parametric model family, and to any discrete prediction task where the number of simultaneously predicted attributes is small, which includes for example all classification tasks prevalent in the machine learning literature. To illustrate the use of the suggested general framework in practice, we show how the approach can be implemented in the special case with the strong independence assumptions underlying the so called Naive Bayes classifier. The resulting Bayesian instancebased classifier is validated empirically with public domain data sets...
Calculating the normalized maximum likelihood distribution for Bayesian forests
 in Proc. IADIS International Conference on Intelligent Systems and Agents
, 2007
"... When learning Bayesian network structures from sample data, an important issue is how to evaluate the goodness of alternative network structures. Perhaps the most commonly used model (class) selection criterion is the marginal likelihood, which is obtained by integrating over a prior distribution fo ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
When learning Bayesian network structures from sample data, an important issue is how to evaluate the goodness of alternative network structures. Perhaps the most commonly used model (class) selection criterion is the marginal likelihood, which is obtained by integrating over a prior distribution for the model parameters. However, the problem of determining a reasonable prior for the parameters is a highly controversial issue, and no completely satisfying Bayesian solution has yet been presented in the noninformative setting. The normalized maximum likelihood (NML), based on Rissanen’s informationtheoretic MDL methodology, offers an alternative, theoretically solid criterion that is objective and noninformative, while no parameter prior is required. It has been previously shown that for discrete data, this criterion can be computed in linear time for Bayesian networks with no arcs, and in quadratic time for the so called Naive Bayes network structure. Here we extend the previous results by showing how to compute the NML criterion in polynomial time for treestructured Bayesian networks. The order of the polynomial depends on the number of values of the variables, but neither on the number of variables itself, nor on the sample size.
Rooij. Asymptotic logloss of prequential maximum likelihood codes
 In Conference on Learning Theory (COLT 2005
, 2005
"... We analyze the DawidRissanen prequential maximum likelihood codes relative to oneparameter exponential family models M. If data are i.i.d. according to an (essentially) arbitrary P, then the redundancy grows at rate 1 2c lnn. We show that c = σ2 1/σ2 2, where σ2 1 is the variance of P, and σ2 2 is ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
We analyze the DawidRissanen prequential maximum likelihood codes relative to oneparameter exponential family models M. If data are i.i.d. according to an (essentially) arbitrary P, then the redundancy grows at rate 1 2c lnn. We show that c = σ2 1/σ2 2, where σ2 1 is the variance of P, and σ2 2 is the variance of the distribution M ∗ ∈ M that is closest to P in KL divergence. This shows that prequential codes behave quite differently from other important universal codes such as the 2part MDL, Shtarkov and Bayes codes, for which c = 1. This behavior is undesirable in an MDL model selection setting. 1