Results 1  10
of
16
Document clustering via adaptive subspace iteration
 In SIGIR
, 2004
"... Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification vi ..."
Abstract

Cited by 36 (7 self)
 Add to MetaCart
(Show Context)
Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.
EntropyBased Criterion in Categorical Clustering
 Proc. of Intl. Conf. on Machine Learning (ICML
, 2004
"... Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and e ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity coefficients.
A Robust Method for Clustering Analysis
 The Annals of Statistics
, 2000
"... We develop a robust clustering method which unites Rousseeuw's minimum covariance determinant method and the determinant criterion of clustering analysis. ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
We develop a robust clustering method which unites Rousseeuw's minimum covariance determinant method and the determinant criterion of clustering analysis.
A unified view on clustering binary data
 Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of dat ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships. 1
Detecting the change of clustering structure in categorical data streams
 SIAM Data Mining Conference
, 2006
"... Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we present a framework for detecting the change of critical clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework extends the work on determining the best K for static datasets (the BkPlot method) to categorical data streams with the help of a Hierarchical Entropy Tree structure (HETree). HETree can efficiently capture the entropy property of the categorical data streams and allow us to draw precise clustering information from the data stream for highquality BkPLots. The experiments show that with the combination of HETree and the BkPlot method we are able to efficiently and precisely detect the change of critical clustering structure in categorical data streams. 1
The ”best k” for entropybased categorical data clustering
 In Inter. Conf. on Scien. and Stat. Database Management
, 2005
"... With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of cluster ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of clusters for a categorical dataset? Since the categorical data does not have the inherent distance function as the similarity measure, the traditional cluster validation techniques based on the geometry shape and density distribution cannot be applied to answer this question. In this paper, we investigate the entropy property of the categorical data and propose a BkPlot method for determining a set of candidate “best Ks”. This method is implemented with a hierarchical clustering algorithm HierEntro. The experimental result shows that our approach can effectively identify the significant clustering structures.
On clustering binary data
 Proceedings of the 2005 SIAM International Conference On Data Mining(SDM’05
, 2005
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions co ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain “bag of words”. The contribution of the paper is twofold. First a new clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. An iterative alternating leastsquares procedure is used for optimization. Second, a unified view of binary data clustering is presented by examining the connections among various clustering criteria. 1
HETree: a Framework for Detecting Changes in Clustering Structure for Categorical Data Streams
"... Analyzing clustering structures in data streams can provide critical information for realtime decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Analyzing clustering structures in data streams can provide critical information for realtime decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to our knowledge, no work has been proposed on monitoring clustering structure for categorical data streams. In this paper, we present a framework for detecting the change of primary clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework uses a Hierarchical Entropy Tree structure (HETree) to capture the entropy characteristics of clusters in a data stream, and detects the change of Best K by combining our previously developed BKPlot method. The HETree can efficiently summarize the entropy property of a categorical data stream and allow us to draw precise clustering information from the data stream for generating highquality BKPlots. We also develop the timedecaying HETree structure to make the monitoring more sensitive to recent changes of clustering structure. The experimental result shows that with the combination of the HETree and the BKPlot method we are able to promptly and precisely detect the change of clustering structure in categorical data streams.
A Spectral Based Clustering Algorithm for Categorical Data with Maximum Modularity
"... Abstract. In this paper we propose a spectral based clustering algorithm to maximize an extended Modularity measure for categorical data; first, we establish the connection with the Relational Analysis criterion. Second, the maximization of the extended modularity is shown as a trace maximization pr ..."
Abstract
 Add to MetaCart
Abstract. In this paper we propose a spectral based clustering algorithm to maximize an extended Modularity measure for categorical data; first, we establish the connection with the Relational Analysis criterion. Second, the maximization of the extended modularity is shown as a trace maximization problem. A spectral based algorithm is then presented to search for the partitions maximizing the extended Modularity criterion. Experimental results indicate that the new algorithm is efficient and effective at finding a good clustering across a variety of realworld data sets 1
Modularity and Spectral CoClustering for Categorical Data
"... Abstract — To tackle the coclustering problem on categorical data, we consider a spectral approach. We first define a generalized modularity measure for the coclustering task. Then, we reformulate its maximization as a trace maximization problem. Finally we develop a spectral based coclustering a ..."
Abstract
 Add to MetaCart
Abstract — To tackle the coclustering problem on categorical data, we consider a spectral approach. We first define a generalized modularity measure for the coclustering task. Then, we reformulate its maximization as a trace maximization problem. Finally we develop a spectral based coclustering algorithm performing this maximization. The proposed algorithm is then capable to cluster rows and colunms simultaneously. Experimental results on synthetic and real data sets confirm the good performance of our algorithm. I.