• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Extensions to the K-means algorithm for clustering large data sets with categorical values (1998)

by Z Huang
Venue:Data Mining Knowl. Discov
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 264
Next 10 →

Survey of clustering algorithms

by Rui Xu, Donald Wunsch II - IEEE TRANSACTIONS ON NEURAL NETWORKS , 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract - Cited by 499 (4 self) - Add to MetaCart
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.

Survey of clustering data mining techniques

by Pavel Berkhin , 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract - Cited by 408 (0 self) - Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
(Show Context)

Citation Context

...lusters. Gonzales [Gon85] used the maximum of intra-cluster variances instead of the sum. Almost every industrial implementation of k-means somehow resolves the issue of categorical attributes. Huang =-=[Hua98]-=- described a possible generalization from k-means to k-prototypes that incorporates categorical attributes. In some applications (in particular to high dimensional data), k-means results in clusters o...

Clustering binary data streams with K-means

by Carlos Ordonez - In Proc. ACM SIGMOD Data Mining and Knowledge Discovery Workshop , 2003
"... Clustering data streams is an interesting Data Mining prob-lem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants in-clude On-line K-means, Scalable K-means, and Incremental K-means, a proposed variant introduced that nds higher quality soluti ..."
Abstract - Cited by 67 (10 self) - Add to MetaCart
Clustering data streams is an interesting Data Mining prob-lem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants in-clude On-line K-means, Scalable K-means, and Incremental K-means, a proposed variant introduced that nds higher quality solutions in less time. Higher quality of solutions are obtained with a mean-based initialization and incremen-tal learning. The speedup is achieved through a simplied set of sucient statistics and operations with sparse matri-ces. A summary table of clusters is maintained on-line. The K-means variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions. 1.
(Show Context)

Citation Context

...oup are similar to each other according to some similarity metric [9]. Most clustering algorithms work with numeric data [3, 6, 14, 26], but there has been work on clustering categorical data as well =-=[12, 15, 18, 23]-=-. The problem definition of clustering categorical data is not as clear as the problem of clustering numeric data [9]. There has been extensive database research on clustering large and high dimension...

Similarity Measures for Categorical Data: A Comparative Evaluation

by Shyam Boriah, Varun Chandola, Vipin Kumar , 2008
"... Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven simi ..."
Abstract - Cited by 57 (3 self) - Add to MetaCart
Measuring similarity or distance between two entities is a key step for several data mining and knowledge discovery tasks. The notion of similarity for continuous data is relatively well-understood, but for categorical data, the similarity computation is not straightforward. Several data-driven similarity measures have been proposed in the literature to compute the similarity between two categorical data instances but their relative performance has not been evaluated. In this paper we study the performance of a variety of similarity measures in the context of a specific data mining task: outlier detection. Results on a variety of data sets show that while no one measure dominates others for all types of problems, some measures are able to have consistently high performance.
(Show Context)

Citation Context

...data that have been proposed recently. Some of them use notions of similarity which are neighborhood-based [15, 4, 8, 26, 1, 22], or incorporate the similarity computation into the learning algorithm =-=[13, 17, 12]-=-. Neighborhood-based approaches use some notion of similarity (usually the overlap measure) to define the neighborhood of a data instance, while the measures we study in this paper are directly used t...

A fuzzy k-modes algorithm for clustering categorical data’, Fuzzy Systems

by Author(s Huang, Z Ng, Zhexue Huang, Michael K. Ng - IEEE Transactions on , 1999
"... ©1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other wo ..."
Abstract - Cited by 57 (5 self) - Add to MetaCart
©1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
(Show Context)

Citation Context

...g Kong, Hong Kong. Publisher Item Identifier S 1063-6706(99)06705-3. To tackle the problem of clustering large categorical data sets in data mining, the -modes algorithm has recently been proposed in =-=[7]-=-. The -modes algorithm extends the -means algorithm by using a simple matching dissimilarity measure for categorical objects, modes instead of means for clusters, and a frequency-based method to updat...

Efficient k-anonymization using clustering techniques

by Ji-won Byun, Elisa Bertino, Ninghui Li - In DASFAA , 2007
"... Abstract. k-anonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an ap ..."
Abstract - Cited by 48 (7 self) - Add to MetaCart
Abstract. k-anonymization techniques have been the focus of intense research in the last few years. An important requirement for such techniques is to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications. In this paper we propose an approach that uses the idea of clustering to minimize information loss and thus ensure good data quality. The key observation here is that data records that are naturally similar to each other should be part of the same equivalence class. We thus formulate a specific clustering problem, referred to as k-member clustering problem. We prove that this problem is NP-hard and present a greedy heuristic, the complexity of which is in O(n 2). As part of our approach we develop a suitable metric to estimate the information loss introduced by generalizations, which works for both numeric and categorical data. 1
(Show Context)

Citation Context

...e problem of partitioning a set of objects into groups such that objects in the same group are more similar to each other than objects in other groups with respect to some defined similarity criteria =-=[5]-=-. Intuitively, an optimal solution of the k-anonymization problem is indeed a set of equivalence classes such that records in the same equivalence class are very similar to each other, thus requiring ...

Recent advances in clustering: A brief survey

by S. B. Kotsiantis, P. E. Pintelas - WSEAS Trans. Inform. Sci. Appl
"... Abstract:- Unsupervised learning (clustering) deals with instances, which have not been pre-classified in any way and so do not have a class attribute associated with them. The scope of applying clustering algorithms is to discover useful but unknown classes of items. Unsupervised learning is an app ..."
Abstract - Cited by 40 (0 self) - Add to MetaCart
Abstract:- Unsupervised learning (clustering) deals with instances, which have not been pre-classified in any way and so do not have a class attribute associated with them. The scope of applying clustering algorithms is to discover useful but unknown classes of items. Unsupervised learning is an approach of learning where instances are automatically placed into meaningful groups based on their similarity. This paper introduces the fundamental concepts of unsupervised learning while it surveys the recent clustering algorithms. Moreover, recent advances in unsupervised learning, such as ensembles of clustering algorithms and distributed clustering, are described.
(Show Context)

Citation Context

...tribute values for categorical attributes can be replaced by the mode value for that attribute across all training instances. Comparisons of various methods for dealing with missing data are found in =-=[20]-=-. Usually, from statistical point of view, instances with many irrelevant input attributes provide little information. Hence, in practical applications, it is wise to carefully choose which attributes...

Improving the Accuracy and Efficiency of the k-means Clustering Algorithm

by K. A. Abdul Nazeer, M. P. Sebastian
"... Abstract — Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data ..."
Abstract - Cited by 37 (0 self) - Add to MetaCart
Abstract — Emergence of modern techniques for scientific data collection has resulted in large scale accumulation of data pertaining to diverse fields. Conventional database querying methods are inadequate to extract useful information from huge data banks. Cluster analysis is one of the major data analysis methods and the k-means clustering algorithm is widely used for many practical applications. But the original k-means algorithm is computationally expensive and the quality of the resulting clusters heavily depends on the selection of initial centroids. Several methods have been proposed in the literature for improving the performance of the k-means clustering algorithm. This paper proposes a method for making the algorithm more effective and efficient, so as to get better clustering with reduced complexity.
(Show Context)

Citation Context

...number of data items, number of clusters and the number of iterations. III. RELATED WORK Several attempts were made by researchers to improve the effectiveness and efficiency of the k-means algorithm =-=[4, 5, 12]-=-. A variant of the k-means algorithm is the k-modes [2, 5] method which replaces the means of clusters with modes. Like the k-means method, the k-modes algorithm also produces locally optimal solution...

Entropy-Based Criterion in Categorical Clustering

by Tao Li, Sheng Ma, Mitsunori Ogihara - Proc. of Intl. Conf. on Machine Learning (ICML , 2004
"... Entropy-type measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropy-based criterion in clustering categorical data. It first shows that the entropy-based criterion can be derived in the formal framework of probabilistic clustering models and e ..."
Abstract - Cited by 35 (4 self) - Add to MetaCart
Entropy-type measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropy-based criterion in clustering categorical data. It first shows that the entropy-based criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity coefficients.
(Show Context)

Citation Context

...of such include the country of origin and the color of eyes in demographic data. Many algorithms have been developed for clustering categorical data, e.g., (Barbara et al., 2002; Gibson et al., 1998; =-=Huang, 1998-=-; Ganti et al., 1999; Guha et al., 2000; Gyllenberg et al., 1997). Entropy-type measures for similarity among objects have been used from early on. In this paper, we show that the entropy-based cluste...

Automated variable weighting in k-means type clustering

by Joshua Zhexue Huang, Michael K. Ng, Hongqiang Rong, Zichen Li - IEEE Transactions on Pattern Analysis and Machine Intelligence , 2005
"... Abstract—This paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step is introduced to the k-means clustering process to iteratively update variable weights based on the current partition of data and a formula for weight calculation is propose ..."
Abstract - Cited by 31 (2 self) - Add to MetaCart
Abstract—This paper proposes a k-means type clustering algorithm that can automatically calculate variable weights. A new step is introduced to the k-means clustering process to iteratively update variable weights based on the current partition of data and a formula for weight calculation is proposed. The convergency theorem of the new clustering process is given. The variable weights produced by the algorithm measure the importance of variables in clustering and can be used in variable selection in data mining applications where large and complex real data are often involved. Experimental results on both synthetic and real data have shown that the new algorithm outperformed the standard k-means type algorithms in recovering clusters in data. Index Terms—Clustering, data mining, mining methods and algorithms, feature evaluation and selection. 1
(Show Context)

Citation Context

...objects intoclusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criteria. The k-means type clustering algorithms =-=[1]-=-, [2] are widely used in real world applications such as marketing research [3] and data mining to cluster very large data sets due to their efficiency and ability to handle numeric and categorical va...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University