Results 1  10
of
12
EntropyBased Criterion in Categorical Clustering
 Proc. of Intl. Conf. on Machine Learning (ICML
, 2004
"... Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and e ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity coefficients.
A unified view on clustering binary data
 Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of dat ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships. 1
The ”best k” for entropybased categorical data clustering
 In Inter. Conf. on Scien. and Stat. Database Management
, 2005
"... With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of cluster ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of clusters for a categorical dataset? Since the categorical data does not have the inherent distance function as the similarity measure, the traditional cluster validation techniques based on the geometry shape and density distribution cannot be applied to answer this question. In this paper, we investigate the entropy property of the categorical data and propose a BkPlot method for determining a set of candidate “best Ks”. This method is implemented with a hierarchical clustering algorithm HierEntro. The experimental result shows that our approach can effectively identify the significant clustering structures.
On clustering binary data
 Proceedings of the 2005 SIAM International Conference On Data Mining(SDM’05
, 2005
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions co ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain “bag of words”. The contribution of the paper is twofold. First a new clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. An iterative alternating leastsquares procedure is used for optimization. Second, a unified view of binary data clustering is presented by examining the connections among various clustering criteria. 1
Research Track Paper A General Model for Clustering Binary Data
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions co ..."
Abstract
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain “bag of words”. The contribution of the paper is threefold. First a general binary data clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. We characterize several variations with different optimization procedures for the general model. Second, we also establish the connections between our clustering model with other existing clustering methods. Third, we also discuss the problem for determining the number of clusters for binary clustering. Experimental results show the effectiveness of the proposed clustering model.
Classification Society of North America Short Course 1 A Combinatorial Introduction to Cluster Analysis
"... ..."
(Show Context)
Geometric Methods for Mining Large and Possibly Private Datasets Approved by:
, 2006
"... To my parents, and my wife iii ACKNOWLEDGEMENTS First of all, I would like to acknowledge the overwhelming contribution and endless hours invested in me by my advisor, Prof. Ling Liu. Without her, I would not even be close to being done. Apart from her research advisory help, I also thank for all th ..."
Abstract
 Add to MetaCart
To my parents, and my wife iii ACKNOWLEDGEMENTS First of all, I would like to acknowledge the overwhelming contribution and endless hours invested in me by my advisor, Prof. Ling Liu. Without her, I would not even be close to being done. Apart from her research advisory help, I also thank for all the useful life tips I have harvested over the past years from her. I wish to thank the faculty members on my committee Prof. Elisa Bertino, Prof. Chinhui Lee, Prof. Shamkant Navathe, and Prof. Edward Omiecinski for their help in reviewing the final thesis and giving valuable comments. Prof. Elisa Bertino, Prof. Edward Omiecinski, Prof. Calton Pu, and Dr. Gordon Sun also helped me a lot in my job search. I would like to thank them all. I am grateful to many people at Georgia Tech for their guidance and support during my Ph.D. study. My colleagues in the Distributed Data Intensive Systems Lab (DISL) were great friends and often provided valuable advice. I especially want to thank Prof. Calton Pu, who had tried to involve me in proposal writing. The DISL group meetings have been a great source of information, pizza,
Under consideration for publication in Knowledge and Information Systems “Best K”: Critical Clustering Structures in Categorical Datasets
"... Abstract. The demand on cluster analysis for categorical data continues to grow over the last decade. A wellknown problem in categorical clustering is to determine the best K number of clusters. Although several categorical clustering algorithms have been developed, surprisingly, none has satisfact ..."
Abstract
 Add to MetaCart
Abstract. The demand on cluster analysis for categorical data continues to grow over the last decade. A wellknown problem in categorical clustering is to determine the best K number of clusters. Although several categorical clustering algorithms have been developed, surprisingly, none has satisfactorily addressed the problem of Best K for categorical clustering. Since categorical data does not have an inherent distance function as the similarity measure, traditional cluster validation techniques based on geometric shapes and density distributions are not appropriate for categorical data. In this paper, we study the entropy property between the clustering results of categorical data with different K number of clusters, and propose the BKPlot method to address the three important cluster validation problems: 1) How can we determine whether there is significant clustering structure in a categorical dataset? 2) If there is significant clustering structure, what is the set of candidate “best Ks”? 3) If the dataset is large, how can we efficiently and reliably determine the best Ks?
THAT DO NOT DEPEND ON THE MARGINAL DISTRIBUTIONS
, 2008
"... We discuss properties that association coefficients may have in general, e.g., zero value under statistical independence, and we examine coefficients for 2 × 2 tables with respect to these properties. Furthermore, we study a family of coefficients that are linear transformations of the observed prop ..."
Abstract
 Add to MetaCart
We discuss properties that association coefficients may have in general, e.g., zero value under statistical independence, and we examine coefficients for 2 × 2 tables with respect to these properties. Furthermore, we study a family of coefficients that are linear transformations of the observed proportion of agreement given the marginal probabilities. This family includes the phi coefficient and Cohen’s kappa. The main result is that the linear transformations that set the value under independence at zero and the maximum value at unity, transform all coefficients in this family into the same underlying coefficient. This coefficient happens to be Loevinger’s H.