| Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63--73. |
....50.53 44.25 93.25 86.78 84.80 average 81.16 75.67 75.44 83.38 83.09 83.77 From a di#erent point of view one can also argue that the LADTree and AdaBoost.MH methods are the first direct induction methods for multiclass option trees, a hitherto unsolved problem. Previous attempts [4, 12] were plagued by the need to specify multiple parameters, and also seemed to contradict each other in their conclusion of why and where in a tree options (i.e. alternatives) were beneficial. Contrary to these attempts, the LADTree and AdaBoost.MH methods have only a single parameter, the final ....
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....models, organized by conditional independence relationships. Examples of classification regression models that produce probabilistic outputs include linear regression, generalized linear regression, probabilistic neural networks (e.g. MacKay, 1992a, 1992b) probabilistic decision trees (e.g. Buntine, 1993# Friedman and Goldszmidt, 1996) kernel density estimation methods (Book, 1994) and dictionary methods (Friedman, 1995) In principle, any of these forms can be used to learn probabilities in a Bayesian network# and, in most cases, Bayesian techniques for learning are available. Nonetheless, the ....
Buntine, W. (1993). Learning classification trees. In Artificial Intelligence Frontiers in Statistics: AI and statistics III. Chapman and Hall, New York.
....data may contain errors (Often called noisy training data) Have training data may contain missing attribute values Decision tree learners can accommodate for data that has missing attribute values. For a detailed comparison of the appropriateness of several types of tree classifiers see [26]. 12 2.4 Association rules Association rule discovery [12] methods learn relations between variables within a dataset. An association rule cons ists of two sets of items called the antecedent and consequent. It indicates that a relationship exists between the two sets, such that the occurrence ....
Buntine, W. Learning Classification Trees. in Artificial Intelligence Frontiers in Statistics. 1993. London: Chapman & Hall.
....in the Philosophy of Science is concerned with the scientific method and considers the basis for accepting some theories and rejecting others. Each of these disciplines has its own terminology for describing the problem. I attempt to summarize some of the terms in tables 1.1, 1.2, and 1. 3 [86, 33, 22, 77]. Note that terminology varies with sub disciplines of each of these fields. Even the names of the sub disciplines are changing. For example, we now have data mining and knowledge discovery in databases (KDD) In this thesis, I use terms from these fields interchangeably. For example, I use the ....
W. Buntine. Learning classification trees. In D.J. Hand, editor, Artificial Intelligence frontiers in statistics, pages 182--201. Chapman & Hall,London, 1993.
....k 2 ) Det # #I # k 1 # by taking product of diagonal terms when I # k 1 is taken as a submatrix element. Notation The symbols listed here are used throughout the thesis. Less obvious symbols are also explained when first used. Some of the following definitions are based on those in [32]. k : This function of n and k is pronounced n choose k and is a commonly occurring function in combinatorics. It is given by k (n k) # : The parameter, or hypothesis, space. It can be multidimensional. E x (c(x) The expected value of c(x) according to the default distribution for ....
W. Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....models, organized by conditional independence relationships. Examples of classification regression models that produce probabilistic outputs include linear regression, generalized linear regression, probabilistic neural networks (e.g. MacKay, 19924, 1992b) probabilistic decision trees (e.g. Buntine, 1993; Friedman and Goldszmidt, 1996) kernel density estimation methods (Book, 1994) and dictionary methods (Friedman, 1995) In principle, any of these forms can be used to learn probabilities in a Bayesian network; and, in most cases, Bayesian techniques for learning are available. Nonetheless, the ....
Buntine, W. (1993). Learning classification trees. In Artificial Intelligence Frontiers in Statistics: AI and statistics III. Chapman and Hall, New York.
....node until a stopping condition is satisfied and the node is declared as a leaf node on a majority vote. In building a decision tree classifier, there is a risk of memorizing the training data, in the sense that nodes near the bottom of the tree represent the noise in the sample, As mentioned in [3], some methods were employed to make better classification. We used two methods [12] to eliminate data over fitting in decision tree classifier. To further improve the zone classification result, we want to make use of context constraint in some zone set. We model context constraint as a Markov ....
W. Buntine. Learning classification trees. Statistics and Computing journal, pages 63--76, 1992.
....tree, where a class is assigned to it. 7.6.2 Eliminating Data Over fitting in Decision Tree Classifier In building a decision tree classifier, there is a risk of memorizing the training data, in the sense that nodes near the bottom of the tree represent the noise in the sample. As mentioned in [8], some ,nethods were employed to make better class probability, such as building multiple trees and use the benefits of averaging, approximate significance tests, etc. We used two simple ,nethods to reduce the data over fitting in the trained decision tree. In Figure 7.7, there is a node with its ....
W. Buntine. Learning classification trees. Statistics and Computing journal, pages 63-76, 1992. 146
....how boosting can be used as a method for learning ADTrees from data (Kearns and Mansour in [9] analyze deci sion tree learning algorithms as boosting algorithms. Their work suggests an algorithm similar to the one presented here) ADTrees are similar to option trees first described by Buntine in [3] and further developed by Kohavi et al. in [10] Option trees were shown to provide significant improvements in classification error compared to single decision trees. The results reported in [10] are com parable to bagged decision trees. Our goal here is to learn a structure similar to option ....
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63-73, 1992.
....induction methods, focusing again on global accuracy improvements. This has led to variations on the mechanism used to generate alternative trees and on the schemes used to aggregate their predictions. The first well known work in this context concerns the Bayesian option trees proposed by Buntine [12], where several trees are maintained in a compact data structure, and a Bayesian scheme is used to determine a posteriori probabilities in order to weight the predictions of these trees. More recently, so called tree bagging and boosting methods were proposed respectively by Breiman [5] and Freund ....
W. Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
.... bound techniques to convert decision tables to optimal trees, see [338] Tree construction using partial or exhaustive lookahead has been considered in statistics [139, 122] in pattern recognition [197] for tree structured vector quantizers [410] for Bayesian class probability trees [62], for neural trees [102] and in machine learning [365,403, 354] Most of these studies indicate that lookahead does not cause considerable improvements over greedy induction. Murthy and Salzberg [354] argued that one level lookahead does not help build significantly better trees, and that ....
....On the contrary, class probability trees assign a probability distribution for all classes at the terminal nodes. Breiman et al. 44] Chapter 4) proposed a method for building class probability trees. Quinlan discussed meth ods of extracting probabilities from decision trees in [397] Buntine [62] described Bayesian methods for building, smoothing and averaging class probability trees. 16 Smoothing in the context of tree structured vector quantizers is described in [17] An approach, which refines the class probability estimates in a greedily induced decision tree using local kernel ....
[Article contains additional citation context not shown here]
WRAY BUNTINE. Learning classification trees. Statistics and Computing, 2:63-73, 1992. 254
....implementing them listed below. 3.1 Decision Trees Decision trees are perhaps the most widely studied inductive learning models in the machine learning community. The literature abounds with papers proposing new models or variations of existing models and case studies using decision trees ([14, 21, 22, 25, 30, 34, 22 40, 43, 49, 50, 51, 53, 89, 93, 98, 99, 100, 101, 102, 104, 105, 106, 107, 109, 110, 111, 112, 113, 114, 118, 120, 123, 126, 129, 130, 131, 133, 134, 136]) For this case study, we use decision tree software from Quinlan and Buntine. Quinlan introduces decision trees and illustrates the use of his C4.5 software for decision trees (c4.5tree) and production rules derived therefrom (c4.5rule) in [105] Several decision tree algorithms (cart, id3, c4, ....
....from Quinlan and Buntine. Quinlan introduces decision trees and illustrates the use of his C4.5 software for decision trees (c4.5tree) and production rules derived therefrom (c4.5rule) in [105] Several decision tree algorithms (cart, id3, c4, minimum message length) are described by Buntine in [26, 27, 29, 30]. The use of the Buntine Caruana IND v2.1 decision tree software available from NASA s Cosmic facility is explained in [31] 3.2 Nearest Neighbor Nearest neighbor models or instance based learners are described in [1, 2, 3, 4, 5, 7, 9, 10, 13, 16, 18, 41, 54, 70, 71, 72, 81] David Aha provided ....
Wray Buntine (1992). Learning Classification Trees. Statistics and Computing 2. 63-73. 88
....tree that does not over fit the data [17, 9] For trees with probabilities at leaves, an alternative is to construct a weighted mixture of the subtrees of the original over fit tree. It is possible to construct a concise representation of a weighting over exponentially many different subtrees [3, 15, 8]. Weighted model mixtures are also widely used in constructing algorithms with on line guarantees. In particular, the weighted majority algorithm and 2 its variants can be proved to compete well with the best expert [11, 4, 6, 5] The weighting used in on line weighted majority algorithms is ....
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....is reached. Several methods have been proposed for inducing decision trees, including the CART algorithm of Breiman et al. [15] Quinlan s ID3 algorithm [129] and its more recent incarnation as C4.5 [130] Bayesian methods for constructing decision trees have also been proposed by Buntine [16]. Our treatment here focuses on the C4.5 algorithm since it is the most widely used tree induction method in the machine learning community. We also use this algorithm directly in some of our subsequent experiments. C4.5 induces a decision tree in a greedy divide and conquer fashion. As the tree ....
Buntine, W. Learning classification trees. Statistics and Computing 2 (June 1992), 63--73.
....tree that does not over fit the data [19, 10] For trees with probabilities at leaves, an alternative is to construct a weighted mixture of the subtrees of the original over fit tree. It is possible to construct a concise representation of a weighting over exponentially many different subtrees [3, 17, 9]. This paper is about stochastic model selection algorithms that stochastically select a model according to a posterior distribution on the models. Stochastic model selection seems intermediate between model selection and model averaging like model averaging it is based on a posterior ....
....(continuous) concept classes. Even for countable classes theorem 1 can lead to a better guarantee than lemma 1 if the posterior Q is spread over exponentially many different models having similar empirical error rates. This might occur, for example, in mixtures of decision trees as constructed in [3, 17, 9]. The second main result of this paper is that the posterior distribution minimizing the error rate bound given in theorem 1 is a Gibbs distribution. For any value of fi 0 we define Q fi to be the posterior distribution defined as follows where Z is a normalizing constant. dQ fi (c) 1 Z ....
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992. 19
....model uncertainty and has been often used in some form or another in the machine learning community. In [16] BMA is applied to Naive Bayes, and it is shown that it improves both classification accuracy and the quality of the probability estimates. In [1, 5] it is applied to rule induction and in [3] to decision tree induction, in both cases leading to good results. 4.2 Local Bayesian Model Averaging In practice, the usage of BMA presents some problems, coming from: ffl The computational cost of calculating Equation 9. ffl The difficulty in the specification of P (M ) the prior ....
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....First, we apply Cesa Bianchi et al. s [3] results on predicting using expert advice (where we view each pruning as an expert ) to obtain an algorithm that has provably low prediction loss, but that is computationally infeasible. Next, we generalize and apply a method developed by Buntine [2, 1] and Willems, Shtarkov and Tjalkens [18, 19] to derive a very efficient implementation of this procedure. 1 Introduction Many algorithms for inferring a decision tree from data, such as C4.5 [11] involve a two step process: In the first step, a very large decision tree is grown to match the ....
....regard ing the source of the data that is being observed. Thus, the resulting algorithm is very robust. A naive implementation of this procedure would require computation time linear in the number of prunings of T ; obviously, this is infeasible. However, we show how techniques used by Buntine [2, 1] and Willems, Shtarkov and Tjalkens [18, 19] can be generalized and applied to our setting, yielding a very efficient implementation requiring computation time at each trial t that is linear in the length of the path defined by the instance x t in the tree T (and therefore is bounded by the ....
[Article contains additional citation context not shown here]
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....the machine learning community and represent two completely different approaches to learning, hence we hope that our results are of a general nature and will generalize to other induction algorithms. Decision trees have been well documented in Quinlan (1993) Breiman et al. 1984) Fayyad (1991) Buntine (1992), and Moret (1982) hence we will not describe them in detail. The Naive Bayes algorithm is explained below. The specific details are not essential for the rest of the paper. The C4.5 algorithm (Quinlan 1993) is a descendent of ID3 (Quinlan 1986) which builds decision trees topdown and prunes ....
....resort to heuristic search. Recently, aggregation techniques, sometimes called stacking, have been advocated by many people in machine learning, neural networks, and Statistics (Wolpert 1992b, Breiman 1994, Freund Schapire 1995, Schapire 1990, Freund 1990, Perrone 1993, Krogh Vedelsby 1995, Buntine 1992, Kwok Carter 1990) It is possible to build many models, each one with a different parameter setting or with a different feature subset, and let them vote on the class. Aggregation techniques reduce the variance of the models by aggregating them, but they make it extremely hard to interpret the ....
Buntine, W. (1992), "Learning classification trees", Statistics and Computing 2(2), June, pp. 63--73.
....in statistical inference is multinomial estimation: Given a past history of observations independent trials with a discrete set of outcomes, predict the probability of the next trial. Such estimators are the basic building blocks in more complex statistical models, such as prediction trees [1, 14, 13], hidden Markov models [11] and Bayesian networks [3, 6] The roots of multinomial estimation go back to Laplace s work in the 18th century [9] In Bayesian theory, the classic approach to multinomial estimation is via the use of the Dirichlet distribution (see for instance [4] Laplace s law ....
W. Buntine. Learning classification trees. In Artificial Intelligence Frontiers in Statistics. Chapman & Hall, 1993.
....rests with the choice of the prior P(model) the user is faced with defining the probability of every model before considering the data. This is often done using priors where the prior probability of a model is a convenient function of its syntactic description, with a bias towards simpler models [8, 2]. Continuous model parameters are often given non informative priors. In (2) the prior P (class) can be often estimated easily and accurately by simply counting the proportion of examples of each class in the data. Here it is the class conditional likelihoods P(examplejclass) that are ....
Wray Buntine. Learning classification trees. In D.J. Hand, editor, Artificial Intelligence Frontiers in Statistics: AI and Statistics III, chapter 15, pages 182--201. Chapman & Hall, London, 1993.
....INTEGRATED INSTANCE BASED LEARNING ALGORITHM 23 . C4.5 (Quinlan 1993) an inductive decision tree algorithm. Zarndt also reported results for ID3 (Quinlan 1986) C4, C4.5 using induced rules (Quinlan 1993) Cart (Breiman et al. 1984) and two decision tree algorithms using minimum message length (Buntine 1992). CN2 (using ordered lists) Clark Niblett 1989) which combines aspects of the AQ rule inducing algorithm (Michalski 1969) and the ID3 decision tree algorithm (Quinlan 1986) Zarndt also reported results for CN2 using unordered lists (Clark Niblett 1989) a naive Bayesian classifier ....
Buntine, Wray. 1992. Learning Classification Trees. Statistics and Computing, vol. 2, pp. 63-73.
....1 6 ffl m MARGINALIZE( x; y) M ; p; K; S; h) 7 (dx; y) M (x; y) M with K features removed using p Output: dx; y) M ; ffl fm 5. Empirical Results We now demonstrate FeatureBoosting of artificial neural nets (ANN) k nearest neighbor (KNN) and decision trees (DT) For DT we used IND (Buntine, 1992). For ANN we use three layer backprop nets with 5 hidden units, conjugate gradient descent, and early stopping with hold out sets. For KNN we use unweighted Euclidean distance with k = 2. We will contrast FeatureBoost with the meta learning algorithms MIXTURE (a simple mixture of experts) and AD ....
Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63--73.
....and each observation is the score assigned by the recogniser to the handwritten word. 3 Parameter Estimation: Dirichlet priors We compare two approaches to the task of parameter estimation. The first approach attempts to incorporate expert knowledge in the form of informative Dirichlet priors [4, 1]. The second approach consists of maximum likelihood estimations of optimal parameters for the model, and is a special case of the former. The approach of Maximum a posteriori (MAP) Estimation involves incorporating linguistic intuitions into the estimation framework. We introduce ff terms that, ....
Buntine, Wray L., "Learning classification trees", in Artificial Intelligence & Statistics III, ed. D.J. Hand, Chapman&Hall, 1992.
....some way so as to get a smaller tree that does not over fit the data [11, 5] An alternative to pruning is to construct a weighted mixture of the subtrees of the original over fit tree. It is possible to construct a concise representation of a weighting over exponentially many different subtrees [3, 9, 4]. This paper proves a new PAC Bayesian theorem giving a bound on the generalization error of weighted mixtures. A weighted mixture which gives too much weight to models with low prior probability will over fit the data in the sense that its provable loss bound is large. A weighting that puts too ....
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
.... conditional probabilities for high dimensional data sets, for an enormous number of examples will be required to correctly assess conditional probabilities of the type P (Cjx) Limited practical derivations of Bayes rule exist, which include linear discrimination, and more recently,Bayes tree [5]. Bayes tree requires knowledge of prior class probabilities (empirically derivable from class proportions in the training data) Associated with a tree is a posterior probability of correct classification. The decision to grow a tree from a node is based upon increasing the posterior ....
W. Buntine. Learning Classification Trees. Statistics and Computing, 2:63--73, 1992.
....m=6 n=8, m=12 Figure 2: Posterior distribution for learning a probability are P(heads) 0 when n 0 and P(heads) 1 when m 0. Note that if the prior isn t very biased, it soon gets dominated by the data. Bayesian learning has been applied to many representations including decision trees [3], neural networks [21] Bayesian networks [10] and unsupervised learning [6] All we need is a way to specify what a particular decision tree, neural network, Bayesian network, or logic program predicts (this is well defined by the definition of the representation) as well as a prior probability ....
....if there is noise in the data, a more detailed decision tree can always be made to fit the data better, but usually has worse predictive properties on unseen examples. A prior probability on decision trees provides a bias that lets us tradeoff fitting the training data with simplicity of the trees [3]. Bayesian leaning is closely related to the minimum description length (MDL) principle. If we were to choose the most likely hypothesis given the data 6 (called the maximum a posteriori probability, or MAP, hypothesis) we can use: arg max h P(hje) arg max h P(ejh) Theta P(h) P(e) arg ....
[Article contains additional citation context not shown here]
W. Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....produced initially and then this tree is pruned for the purpose of obtaining a better predictor. A pruning is produced by deleting some nodes and with them all their successors. Although there are exponentially many prunings, a recent method developed in coding theory [WST95] and machine learning [Bun92] makes it possible to (implicitly) maintain one weight per pruning. In particular Helmbold and Schapire [HS97] use this method to design an elegant algorithm that is guaranteed to predict nearly as well as the best pruning of a decision tree. Pereira and Singer [PS97] modify this algorithm to the ....
W. Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....of X i in G. Using Bayes rule, it follows that Pr(G h ; L h G j D) Pr(D j L h G ; G h ) Pr(L h G j G h ) Pr(G h ) The specification of priors on local structures presents no additional complications other than the specification of priors for the structure of the network G h . Buntine (1991a, 1993), for example, suggests several possible priors on decision trees. A natural prior over local structures is defined via the MDL description length, by setting Pr(L h G j G h ) 2 Gamma P i DL local struct (L i ) For the term Pr(D j L h G ; G h ) we make an assumption of ....
W. Buntine. Learning classification trees. In D. J. Hand, editor, Artificial Intelligence Frontiers in Statistics, number III in AI and Statistics. Chapman & Hall, London, 1993.
....are a special kind of parameterized graph structure. Unsurprisingly, many aspects of the priors discussed in this section can be found in Bayesian approaches to the induction of graph based models in other domains (e.g. Bayesian networks (Cooper Herskovits 1992; Buntine 1991) and decision trees (Buntine 1992)) 3.4.1 Structural vs. parameter priors An HMM can be described in two stages: 1. A model structure or topology is specified as a set of states, transitions and emissions. Transitions and emissions represent discrete choices as to which paths and outputs can have non zero probability in the ....
....weights ff i determine the bias embodied in the prior: the prior expectation of i is ff i ff 0 , where ff 0 = P i ff i is the total prior weight. One important reason for the use of the Dirichlet prior in the case of multinomial parameters (Cheeseman et al. 1988; Cooper Herskovits 1992; Buntine 1992) is its mathematical expediency. It is a conjugate prior, i.e. of the same functional form as the likelihood function for the multinomial. The likelihood for a sample from the multinomial with total observed outcomes c 1 ; c n is given by 4 P (c 1 ; c n j ) n Y i=1 c ....
[Article contains additional citation context not shown here]
Buntine, Wray. 1992. Learning classification trees. In Artificial Intelligence Frontiers in Statistics: AI and Statistics III , ed. by D. J. Hand. Chapman & Hall.
....problems in statistical inference is multinomialestimation: Given a past history of observations independent trials with a discrete set of outcomes, predict the probability of the next trial. Such estimators are the basic building blocks in more complex statistical models, such as prediction trees [1, 15, 14], hidden Markov models [12] and Bayesian networks [3, 7] The roots of multinomial estimation go back to Laplace s work in the 18 th century [10] In Bayesian theory, the classic approach to multinomial estimation is via the use of the Dirichlet distribution (see for instance [4] Laplace s law ....
W. Buntine. Learning classification trees. In D. J. Hand, editor, Artificial Intelligence Frontiers in Statistics, number III in AI and Statistics. Chapman & Hall, London, 1993.
....to further simplify them at the expense of some lost amount of ensemble quality. But the only principled way of choosing a good upper bound for ensemble sizes seems to be a (costly) wrapper like approach based on cross validation. A comparison to bayesian approaches as described in [Oliver 95, Buntine 91] is clearly necessary for judging the merits of the presented ideas. So in summary the ideas described above can be seen as first successful steps of a much larger endeavour aimed at getting back interpretable structure of ensemble generating procedures while at least retaining some of these ....
Buntine W.L.: Learning Classification Trees, Statistics and Computing, Vol.2:63-73, 1991.
....84] This framework, however, is known to mix search bias (introduced when the algorithm decides on the order in which attributes are to be used in splitting) with hypotheses space bias. To avoid being trapped by this bias, several researchers have suggested averaging over multiple trees (e.g. Buntine, 91] In this paper, still within a recursive partitioning framework, we propose using an alternative data structure called SE tree [Rymon, 92] On one hand, since the new framework shares many of the features of decision tree based algorithms, we should be able to adopt many sub techniques ....
Buntine, W., Learning Classification Trees. Technical Report, NASA Ames Research Center, 1991.
....denoted by QU0 and QU1, respectively. The corresponding trees with linear combination splits are denoted by QL0 and QL1, respectively. The results in this paper are based on version 1.7.10 of the program. The software is obtained from http: www.stat.wisc.edu loh quest.html. IND: This is due to Buntine (1992). We use version 2.1 with the default settings. IND comes with several standard predefined styles. We compare four Bayesian styles in this paper: bayes, bayes opt, mml, and mml opt (denoted by IB, IBO, IM, and IMO, respectively) The opt methods extend the non opt methods by growing several ....
Buntine, W. (1992). Learning classification trees, Statistics and Computing 2: 63--73.
.... branch andbound techniques to convert decision tables to optimal trees, see [338] Tree construction using partial or exhaustive lookahead has been considered in statistics [139, 122] in pattern recognition [197] for tree structured vector quantizers [410] for Bayesian class probability trees [62], for neural trees [102] and in machine learning [365, 403, 354] Most of these studies indicate that lookahead does not cause considerable improvements over greedy induction. Murthy and Salzberg [354] argued that one level lookahead does not help build significantly better trees, and that ....
....On the contrary, class probability trees assign a probability distribution for all classes at the terminal nodes. Breiman et al. 44] Chapter 4) proposed a method for building class probability trees. Quinlan discussed methods of extracting probabilities from decision trees in [397] Buntine [62] described Bayesian 27 methods for building, smoothing and averaging class probability trees. 16 Smoothing in the context of tree structured vector quantizers is described in [17] An approach, which refines the class probability estimates in a greedily induced decision tree using local kernel ....
[Article contains additional citation context not shown here]
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992. 164
....networks and Bayesian networks formalisms [12] ffl Mathematically rigourous approach to learning based on Bayesian statistics ; The vertue of Bayesian approach to learning has been recognized by machine learning community. Bayesian algorithms have been developed to learn classification trees [13], to perform unsupervised classification [17] to train different types of neural networks [15] 16] This promising line of research in machine learning needs further investigation. ffl Challenging real world application demanding Bayesian network based representation ; The stuck problem of ....
Buntine, W. (1991) Learning classification trees. In Hand, D. (Ed.), Artificial Intelligence Frontiers in Statistics, London : Chapman and Hall, pp. 182-201.
....bits required to encode M S , e.g. by listing all transitions and emissions. The prior probabilities for M , on the other hand, are assigned using a Dirichlet distribution for each of the transition and emission multinomial parameters, similar to the Bayesian decision tree induction method of Buntine (1992). The parameter prior effectively spreads the posterior probability as if a certain number of evenly distributed virtual samples had been observed for each transition and emission. For convenience we assume that the parameters associated with each state are a priori independent. There are ....
BUNTINE, WRAY. 1992. Learning classification trees. In Artificial Intelligence Frontiers in Statistics: AI and Statistics III, ed. by D. J. Hand. Chapman & Hall.
....First, we apply Cesa Bianchi et al. s [4] results on predicting using expert advice (where we view each pruning as an expert ) to obtain an algorithm that has provably low prediction loss, but that is computationally infeasible. Next, we generalize and apply a method developed by Buntine [3] [2] and Willems, Shtarkov and Tjalkens [20] 21] to derive a very efficient implementation of this procedure. 1. Introduction Many algorithms for inferring a decision tree from data, such as C4.5 [13] involve a two step process: In the first step, a very large decision tree is grown to match the ....
....regarding the source of the data that is being observed. Thus, the resulting algorithm is very robust. A naive implementation of this procedure would require computation time linear in the number of prunings of T ; obviously, this is infeasible. However, we show how techniques used by Buntine [3] [2] and Willems, Shtarkov and Tjalkens [20] 21] can be generalized and applied to our setting, yielding a very efficient implementation requiring computation time at each trial t that is linear in the length of the path defined by the instance x t in the tree T (and therefore is bounded by the ....
[Article contains additional citation context not shown here]
Wray Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
....i;pa(X i ) X i ; Pa(X i ) Q i;pa(X i ) Again, it is easy to verify that such a model is separable: each combination of legal choices of Q i;pa(X i ) results in a probability distribution. Other examples of separable factored models include multinets [14] mixture models [6] decision trees [5], decision graphs, and the combination of the latter two representations with belief networks [4, 13, 8] An example of a class of models that are factored in a non trivial sense but are not separable are non chordal Markov networks [22] The probability distribution defined by such networks has a ....
....O(n) modifications that involve further changes to the parts of the model that were changed in the previous iteration. Another example of a search procedure that exploits the same factorization properties is the standard divide and conquer approach for learning decision trees, see for example [5]. A decision tree is a factored model where each factor corresponds to a leaf of the tree. If we replace a leaf by subtree, or replace a subtree by a leaf, all of the other factors in the model remain unchanged. This formal property justifies independent search for the structure of each subtree ....
W. Buntine. Learning classification trees. In D. J. Hand, ed., AI & Stats 3, 1993.
....5 Constructing Decision Trees Non backtracking decision trees are an attractive technology for classification, since they promise a dramatic trade off of time for space. A theoretical drawback is that inferring optimal trees, under various criteria, appears to be computationally infeasible[Bun87]; however, suboptimal heuristics often build roughly balanced, strongly pruning trees[CN84, WS87] Most such heuristics are greedy: given an incomplete tree, they choose the next split (of a leaf) that is locally most promising: for example, it may, among all possible single next splits, maximize ....
W. Buntine, "Learning Classification Trees," Statistics and Computing, vol. 2, 1992, pp. 63--73.
No context found.
Buntine, W. (1992). Learning classification trees.
No context found.
Buntine, W. (1992). Learning classification trees. Statistics and Computing, 2, 63--73.
No context found.
Buntine, W. (1991). Learning classification trees. NASA Ames Technical Report FIA-90-12-19-01, Moffett Field, CA.
No context found.
W. Buntine. Learning classification trees. Statistics and Computing, 2:63--73, 1992.
No context found.
W. Buntine. Learning classification trees. In D. J. Hand, editor, Artificial Intelligence frontiers in statistics, pages 182--201. Chapman & Hall, London, 1993.
No context found.
W. Buntine, "Learning classification tree," Statist. Comput., vol. 2, pp. 63--73, 1992.
No context found.
W. Buntine, "Learning Classification Trees," Statistics and Computing, vol. 2, pp. 63--73, 1992.
No context found.
BUNTINE,WRAY. 1992. Learning classification trees. In Artificial Intelligence Frontiers in Statistics: AI and Statistics III, ed. by D. J. Hand. Chapman & Hall.
No context found.
Buntine, W. (1992a), `Learning classification trees', Statistics and Computing 2(2), 63--73.
No context found.
Buntine, W. (1992) Learning Classification Trees. Statistics and Computing 2:63-73.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC