| P. Smyth. Clustering Using Monte Carlo Cross-Validation. In Proceedings Knowledge Discovery and Data Mining, pages 126--133, 1996. |
....concave but not always circular clusters, found di#erent indices were better on di#erent data sets showing their shape dependence. In a few cases, Clustering Validation approaches have been integrated into clustering algorithms giving a relatively automatic clustering process. Smyth presented MCCV [25], the Monte Carlo Cross Validation algorithm though this is intended for data sets where a likelihood function such as Gaussian mixture models can be defined. We have developed TURN [5] and TURN [6] which handle arbitrary shapes, noise, and very large data sets in a fast and e#cient way. TURN is ....
Smyth P. (1996) Clustering using monte carlo cross-validation. Proc. ACMSIGKDD Int. Conf. Knowledge Discovery and Data Mining.
....of K, the number of classes, is fixed prior to performing prioritization. There are at least two methods that we could use to determine a suitable value for K. First, the number of classes to use could be determined using the data in an unsupervised manner, e.g. via cross validation techniques [14]. The second method of determining the number of classes would be to use manual intervention. After the rover sends down initial images, the number of classes could be determined by experts on the ground. This value then would be uplinked and the onboard system updated. An example of ....
P. Smyth, "Clustering Using Monte-Carlo Cross-Validation," Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, AAAI Press, Portland, Oregon, pp. 126--133, 1996. 13
....addresses either how to identify the optimal number of clusters K or how to select feature subsets while determining the correct number of clusters. The latter problem is more difficult because of the inter dependency between the number of clusters and the feature subsets used to form the clusters [49]. To this point, most research on unsupervised model selection has considered the problem of identifying the right number of clusters using all available features [38,49] Other researchers [1,51] have studied feature selection and clustering together. In particular, Devaney and Ram [17] ....
.... is more difficult because of the inter dependency between the number of clusters and the feature subsets used to form the clusters [49] To this point, most research on unsupervised model selection has considered the problem of identifying the right number of clusters using all available features [38,49]. Other researchers [1,51] have studied feature selection and clustering together. In particular, Devaney and Ram [17] combined a sequential forward and backward search algorithm with two concept learning algorithms, COBWEB [21] and AICC, an improved variant of COBWEB. The category utility ....
P. Smyth, Clustering using Monte Carlo cross-validation, in: Proc. of 2nd Int'l Conf. on Knowledge Discovery and Data Mining (KDD-96), 1996, pp. 126--133.
....efficiency. 3 The MMDL Criterion It was shown in [25] that MDL BIC (although simpler) performs comparably with EBB and MML, although it sometimes slightly underestimates the true k. A similar conclusion can be obtained from the many (20) tests described in [20] It was also reported in [11] and [29] that MDL BIC tends to slightly underestimate the true order. In order to overcome this problem, let us look again at the MDL criterion in Eq. 11) The meaning of the MDL cost function is the total code length of a two part code for the observed data y and the parameter estimate Theta (k) ....
P.Smyth. Clustering using Monte-Carlo cross-validation. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 126--133. AAAI Press, Menlo Park, CA, 1996.
....parameter space (the logarithra GTP space in the stick pulling experiment) Each cluster is believed to be a small group of specialists, and essentially, the more clusters we have, the more diverse the system is. However, finding the right number of clusters for a data set is often ill posed [24]. Depending on different criteria, one number may or may not be better than another number. Another difficulty resides in the a priori parameters required by many clustering algorithms existing in the literature. Such parameters, including the number of initial centers, or the maximal ....
Smyth, P., Clustering using Monte Carlo cross-validation. in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), (Portland, Oregon, 1996), AAAI Press, pp. 126-133.
....without asymptotically derived complexity term. 2) Cross validation is is a general and well understood technique for addressing overfitting. In particular, Monte Carlo cross validation, a variant of the standard v fold cross validation, has been successfully applied to model dimension selection [14, 15]. The generalized likelihood ratio test [2] is an example of a test on the increase in the model likelihood, compared to an assumed underlying distribution. There are techniques that try to locate a knee in the error or likelihood curve (cf. Figure 1, middle) The broken stick model [10] is widely ....
....were used to estimate the p values. Finally, all results with the synthetic data sets were averaged over 100 similarily generated random data sets. The prediction accuracy obtained by Pete was evaluated against the penalized likelihood based BIC score [12] and Monte Carlo cross validation [14, 15]. We evaluated the performance of BIC in two ways: i) in a realistic setting, in which the variance of noise was estimated from the residuals (maximum likelihood) and (ii) in an unrealistic setting in which the true variance of noise was given as a parameter (we call this oracle BIC to ....
[Article contains additional citation context not shown here]
P. Smyth. Clustering using Monte Carlo cross-validation. In Second International Conference on Knowledge Discovery and Data Mining, pages 126--133, Portland, Oregon, USA, 1996.
....efficiency. 3 The MMDL Criterion It was shown in [25] that MDL BIC (although simpler) performs comparably with EBB and MML, although it sometimes slightly underestimates the true k. A similar conclusion can be obtained from the many (20) tests described in [20] It was also reported in [11] and [29] that MDL BIC tends to slightly underestimate the true order. In order to overcome this problem, let us look again at the MDL criterion in Eq. 11) The meaning of the MDL cost function is the total code length of a two part code for the observed data y obs and the parameter estimate b Theta (k) ....
P. Smyth. Clustering using Monte-Carlo cross-validation. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 126--133. AAAI Press, Menlo Park, CA, 1996.
.... et al. 2000] Analysis techniques such as k means clustering, clustering by principal components, average linkage clustering [Jain and Dubes, 1988] self organizing maps [Golub et al. 1999] agglomerative and hierarchical algorithms [Eisen et al. 1998,Reymond et al. 2000] Bayesian methods [Smyth, 1996], plaid models [Lazzeroni and Owen, 2000] and support vector machines [Brown et al. Lenwood S. Heath 2000] have been featured in a majority of published research. A commonly accepted dichotomy of analysis techniques distinguishes between supervised and unsupervised methods. Unsupervised ....
Smyth, P. (1996). Clustering using Monte-Carlo cross-validation. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining.
....efforts [Dave96, GG89, RLR98, ThK99] Although several validity indices have been introduced, in practice most of the clustering methods do not use any of them. Furthermore, formal methods in data base and data mining applications for finding the best partitioning of a data set are very few [Smyth96] iv. the resulting rules may hide knowledge A rule is an implication of the form A B where A, B are groups of attributes or groups of categories (each attribute includes more than one categories) Then all the sets of values belong to the categories denoted by A, B equally contribute to the ....
Padhraic Smyth. "Clustering using Monte Carlo Cross-Validation". KDD 1996, 126-133.
....k ) H(f 0 ; f k ) We then choose k to maximize T k . There is a tradeo here. If n 1 is small, then the estimates f k are based on little data and will not be accurate. If n 2 is small then T k will be a poor estimate of H(f 0 ; f k ) it is unbiased but will have a large variance. Smyth (1996), building on work of Burman (1989) and Shao (1993) suggests one possible implementation of this method. Split the data randomly into two equal parts and compute T k as above. Now repeat this many times, choosing the splits at random, then average the values of T k and choose k from these ....
Smyth, P. (1996). Clustering using Monte Carlo Cross-Validation. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press.
....(Duda and Hart, 1973) information based criteria (Cutler and Windham, 1994) and more. Another approach to cluster validity includes some variant of cross validation (Fukunaga, 1990) Such methods were introduced both in the context of hard clustering (Jain and Moreau, 1986) and fuzzy clustering (Smyth, 1996). The approach presented here falls into this category. In this paper we present a method to help select which clustering result is more reliable. The method can be used to compare different algorithms, but it is most suitable to identify, within the same algorithm, those partitions that can be ....
Smyth, P. (1996). Clustering using monte carlo cross-validation. KDD-96 Proceedings, Second International Conference on Knowledge Discovery and Data Mining, pages 126--133.
....the EM algorithm and comparing performance with available alternatives. We do not address the issue of choosing initial models parameters (see [BF98,FRB98,MH98] for the problem of initial models) nor do we address the issue of setting the number of clusters k (an open research problem, e.g. [CS96,S96]) The goal is to study scalability properties and performance for a given k and set of initial conditions. Comparing against alternatives is based on quality of obtained solutions. It is an established fact in the statistical literature [PE96,GMPS97,CS96] that EM modeling results in better ....
P. Smyth. Clustering using Monte Carlo Cross-Validation. In Proc. 2 nd Intl. Conf. on Knowledge Discovery and Data Mining (KDD96), AAAI Press, 1996.
....adjustable model parameters. Postulating too few parameters leads to a poor fit whereas too many parameters can lead to overfitting thus distorting the density of the underlying data [16] Recent efforts defining the model selection problem as one of estimating the number of clusters have emerged [12, 20]. One aspect that has been ignored, however, is that of clustering in different feature spaces. It is easy to see, particularly in applications with large number of features , that various choices of feature subsets will reveal different structures underlying the data. It is our contention that ....
....for future work. 2.0 Model Selection In Clustering Model selection approaches in clustering have primarily concentrated on the problem of determining the number of components clusters. These attempts include Bayesian approaches [12] MDL based approaches [16] and cross validation techniques [20]. As noticed in [20, page 14, last paragraph] however, the optimal number of clusters is dependent on the feature space in which the clustering is performed. 2.1 A Generalized Model for Clustering Let be a data set consisting of patterns We will assume that the patterns are represented in D d ....
[Article contains additional citation context not shown here]
Smyth, P., Clustering using Monte Carlo cross-validation, KDD, 1996.
....There are also no theoretical results that would allow a mimic of binary search in model search. For example, one would like to be able to train models only sparsely and to especially avoid training of large number of very complex models. ffl Cross validation (CV) is an empirical approach [Smyth, 1996], Smyth et al. 1997] where one simulates what would happen if there were a large amount of data. In cross validation, one randomly splits the available data into two distinct subsets, training and test. Each of the models from the model universe is trained (fitted) on the training subset using ....
Smyth, P., "Clustering using Monte-Carlo cross validation," in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp.126--133, 1996.
No context found.
Smyth, P., `Clustering using Monte-Carlo cross validation,' in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp.126--133, 1996.
....between this and the v fold method is that each datapoint may be used as a test point more than once. 10 fold cross validation was found to be much less reliable than MCCV (fi = 0:5) in terms of choosing the correct number of components on a set of both simulated and real mixture modeling problems (Smyth, 1996). In general, there appears to be no obvious systematic method for automatically determining the best value of fi to use for a particular problem when the true structure is unknown, although the choice of fi = 0:5 appears to be reasonably robust across a variety of problems (Smyth, 1996) In ....
.... problems (Smyth, 1996) In general, there appears to be no obvious systematic method for automatically determining the best value of fi to use for a particular problem when the true structure is unknown, although the choice of fi = 0:5 appears to be reasonably robust across a variety of problems (Smyth, 1996). In terms of choosing the number of different partitions M , the larger the value of M the less the variability in the log likelihood estimates. In practice, values of M between 20 and 50 appear adequate for most applications. Finally, it is worth noting that there is an extra computational cost ....
Smyth, P., `Clustering using Monte-Carlo cross validation,' in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA: AAAI Press, pp.126--133, 1996.
No context found.
P. Smyth. Clustering Using Monte Carlo Cross-Validation. In Proceedings Knowledge Discovery and Data Mining, pages 126--133, 1996.
No context found.
P. Smyth. Clustering using Monte Carlo cross-validation. In E. Simoudis, J. Han, and U. M. Fayyad, eds., Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 126--133. AAAI Press, 1996.
No context found.
P. Smyth. Clustering using Monte Carlo cross-validation. In Second International Conference on Knowledge Discovery and Data Mining, 1996.
No context found.
P. Smyth. Clustering using Monte Carlo cross-validation. In Second International Conference on Knowledge Discovery and Data Mining, 1996.
No context found.
P. Smyth. Clustering using Monte Carlo cross-validation. In E. Simoudis, J. Han, and U. M. Fayyad, eds., Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 126--133. AAAI Press, 1996.
No context found.
P. Smyth. Clustering using Monte Carlo cross-validation. In E. Simoudis, J. Han, and U. M. Fayyad, eds., Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 126-133. AAAI Press, 1996.
No context found.
P. Smyth. Clustering using Monte Carlo cross-validation. In Second International Conference on Knowledge Discovery and Data Mining, 1996.
No context found.
P. Smyth. "Clustering using Monte Carlo Cross-Validation". KDD-96, pp.126 133, 1996.
No context found.
Smyth, P., Clustering using Monte Carlo cross-validation, KDD, 1996.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC