Results 1  10
of
12
On combining multiple clusterings: an overview and a new perspective
"... Abstract Many problems can be reduced to the problem of combining multiple clusterings. In this paper, we first summarize different application scenarios of combining multiple clusterings and provide a new perspective of viewing the problem as a categorical clustering problem. We then show the conne ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract Many problems can be reduced to the problem of combining multiple clusterings. In this paper, we first summarize different application scenarios of combining multiple clusterings and provide a new perspective of viewing the problem as a categorical clustering problem. We then show the connections between various consensus and clustering criteria and discuss the complexity results of the problem. Finally we propose a new method to determine the final clustering. Experiments on kinship terms and clustering popular music from heterogeneous feature sets show the effectiveness of combining multiple clusterings.
A Binary Matrix Factorization Algorithm for Protein Complex Prediction
"... We propose a binary matrix factorization (BMF) algorithm under the Bayesian YingYang (BYY) harmony learning, to detect protein complexes by clustering the proteins which share similar interactions through factorizing the binary adjacent matrix of the proteinprotein interaction (PPI) network. The p ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
We propose a binary matrix factorization (BMF) algorithm under the Bayesian YingYang (BYY) harmony learning, to detect protein complexes by clustering the proteins which share similar interactions through factorizing the binary adjacent matrix of the proteinprotein interaction (PPI) network. The proposed BYYBMF algorithm automatically determines the cluster number while this number is usually specified for most existing BMF algorithms. Also, BYYBMF's clustering results does not depend on any parameters or thresholds, unlike the Markov Cluster Algorithm (MCL) that relies on a socalled inflation parameter. On synthetic PPI networks, the predictions evaluated by the known annotated complexes indicate that BYYBMF is more robust than MCL for most cases. Moreover, BYYBMF obtains a better balanced prediction accuracies than MCL and a spectral analysis method, on real PPI networks from the MIPS and DIP databases.
HACS: Heuristic Algorithm for Clustering Subsets
"... The term consideration set is used in marketing to refer to the set of items a customer thought about purchasing before making a choice. While consideration sets are not directly observable, finding common ones is useful for market segmentation and choice prediction. We approach the problem of induc ..."
Abstract
 Add to MetaCart
(Show Context)
The term consideration set is used in marketing to refer to the set of items a customer thought about purchasing before making a choice. While consideration sets are not directly observable, finding common ones is useful for market segmentation and choice prediction. We approach the problem of inducing common consideration sets as a clustering problem on the space of possible item subsets. Our algorithm combines ideas from binary clustering and itemset mining, and differs from other clustering methods by reflecting the inherent structure of subset clusters. Experiments on both real and simulated datasets show that our algorithm clusters effectively and efficiently even for sparse datasets. In addition, a novel evaluation method is developed to compare clusters found by our algorithm with known ones. 1
Data mining: a tool for detecting cyclical disturbances in supply networks
, 1771
"... Abstract: Disturbances in supply chains may be either exogenous or endogenous. The ability automatically to detect, diagnose, and distinguish between the causes of disturbances is of prime importance to decision makers in order to avoid uncertainty. The spectral principal component analysis (SPCA) t ..."
Abstract
 Add to MetaCart
Abstract: Disturbances in supply chains may be either exogenous or endogenous. The ability automatically to detect, diagnose, and distinguish between the causes of disturbances is of prime importance to decision makers in order to avoid uncertainty. The spectral principal component analysis (SPCA) technique has been utilized to distinguish between real and rogue disturbances in a steel supply network. The data set used was collected from four different business units in the network and consists of 43 variables; each is described by 72 data points. The present paper will utilize the same data set to test an alternative approach to SPCA in detecting the disturbances. The new approach employs statistical data preprocessing, clustering, and classification learning techniques to analyse the supply network data. In particular, the incremental kmeans clustering and the RULES6 classification rulelearning algorithms, developed by the present authors ’ team, have been applied to identify important patterns in the data set. Results show that the proposed approach has the capability automatically to detect and characterize networkwide cyclical disturbances and generate hypotheses about their root cause.
A Comparison of Categorical Attribute Data Clustering Methods
"... Abstract. Clustering data in Euclidean space has a long tradition and there has been considerable attention on analyzing several different cost functions. Unfortunately these result rarely generalize to clustering of categorical attribute data. Instead, a simple heuristic kmodes is the most commonl ..."
Abstract
 Add to MetaCart
Abstract. Clustering data in Euclidean space has a long tradition and there has been considerable attention on analyzing several different cost functions. Unfortunately these result rarely generalize to clustering of categorical attribute data. Instead, a simple heuristic kmodes is the most commonly used method despite its modest performance. In this study, we model clusters by their empirical distributions and use expected entropy as the objective function. A novel clustering algorithm is designed based on local search for this objective function and compared against six existing algorithms on well known data sets. The proposed method provides better clustering quality than the other iterative methods at the cost of higher time complexity. 1
1Mining Projected Clusters in HighDimensional Spaces
, 2008
"... Clustering highdimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the fulldimensional space. To address this problem, a nu ..."
Abstract
 Add to MetaCart
(Show Context)
Clustering highdimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the fulldimensional space. To address this problem, a number of projected clustering algorithms have been proposed. However, most of them encounter difficulties when clusters hide in subspaces with very low dimensionality. These challenges motivate our effort to propose a robust partitional distancebased projected clustering algorithm. The algorithm consists of three phases. The first phase performs attribute relevance analysis by detecting dense and sparse regions and their location in each attribute. Starting from the results of the first phase, the goal of the second phase is to eliminate outliers, while the third phase aims to discover clusters in different subspaces. The clustering process is based on the Kmeans algorithm, with the computation of distance restricted to subsets of attributes where object values are dense. Our algorithm is capable of detecting projected clusters of low dimensionality embedded in a highdimensional space and avoids the computation of the distance in the fulldimensional space. The suitability of our proposal has been demonstrated through an empirical study using synthetic and real datasets. Index Terms Data mining, clustering, high dimensions, projected clustering.
HACS: Heuristic Algorithm for Clustering Subsets
"... The term consideration set is used in marketing to refer to the set of items a customer thought about purchasing before making a choice. While consideration sets are not directly observable, finding common ones is useful for market segmentation and choice prediction. We approach the problem of induc ..."
Abstract
 Add to MetaCart
(Show Context)
The term consideration set is used in marketing to refer to the set of items a customer thought about purchasing before making a choice. While consideration sets are not directly observable, finding common ones is useful for market segmentation and choice prediction. We approach the problem of inducing common consideration sets as a clustering problem on the space of possible item subsets. Our algorithm combines ideas from binary clustering and itemset mining, and differs from other clustering methods by reflecting the inherent structure of subset clusters. Experiments on both real and simulated datasets show that our algorithm clusters effectively and efficiently even for sparse datasets. In addition, a novel evaluation method is developed to compare clusters found by our algorithm with known ones. 1
HACS: Heuristic Algorithm for Clustering Subsets
"... The term consideration set is used in marketing to refer to the set of items a customer thought about purchasing before making a choice. While consideration sets are not directly observable, finding common ones is useful for market segmentation and choice prediction. We approach the problem of induc ..."
Abstract
 Add to MetaCart
(Show Context)
The term consideration set is used in marketing to refer to the set of items a customer thought about purchasing before making a choice. While consideration sets are not directly observable, finding common ones is useful for market segmentation and choice prediction. We approach the problem of inducing common consideration sets as a clustering problem on the space of possible item subsets. Our algorithm combines ideas from binary clustering and itemset mining, and differs from other clustering methods by reflecting the inherent structure of subset clusters. Experiments on both real and simulated datasets show that our algorithm clusters effectively and efficiently even for sparse datasets. In addition, a novel evaluation method is developed to compare clusters found by our algorithm with known ones. 1
Research Overview
, 2008
"... My research explores two related topics on learning from data—how to efficiently discover useful patterns and how to effectively retrieve information. The interests lie broadly in data mining, machine learning, information retrieval, and bioinformatics studying both the algorithmic and application ..."
Abstract
 Add to MetaCart
(Show Context)
My research explores two related topics on learning from data—how to efficiently discover useful patterns and how to effectively retrieve information. The interests lie broadly in data mining, machine learning, information retrieval, and bioinformatics studying both the algorithmic and application issues. I focus strongly on research challenges grounded in realworld problems, and work to validate my research in this context. I received NSF Career Award, Two IBM Faculty research awards, an IBM Shared University Research (SUR) award, and a Xerox University Affairs Committee (UAC) award for my work on data mining and its applications. All these awards are highly competitive and recognize the quality and importance of my work. My research output so far is: 22 papers in peerreviewed journals, 2 book chapters, 72 papers