Results 1  10
of
142
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 499 (4 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Approximate kernel kmeans: Solution to large scale kernel clustering
 in Proceedings of the International Conference on Knowledge Discovery and Data mining
"... Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and h ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
(Show Context)
Digital data explosion mandates the development of scalable tools to organize the data in a meaningful and easily accessible form. Clustering is a commonly used tool for data organization. However, many clustering algorithms designed to handle large data sets assume linear separability of data and hence do not perform well on real world data sets. While kernelbased clustering algorithms can capture the nonlinear structure in data, they do not scale well in terms of speed and memory requirements when the number of objects to be clustered exceeds tens of thousands. We propose an approximation scheme for kernel kmeans, termed approximate kernel kmeans, that reduces both the computational complexity and the memory requirements by employing a randomized approach. We show both analytically and empirically that the performance of approximate kernel kmeans is similar to that of the kernel kmeans algorithm, but with dramatically reduced runtime complexity and memory requirements.
Gaussian Processes for Active Data Mining of Spatial Aggregates
 In Proceedings of the SIAM International Conference on Data Mining
, 2005
"... We present an active data mining mechanism for qualitative analysis of spatial datasets, integrating identification and analysis of structures in spatial data with targeted collection of additional samples. The mechanism is designed around the spatial aggregation language (SAL) for qualitative ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
(Show Context)
We present an active data mining mechanism for qualitative analysis of spatial datasets, integrating identification and analysis of structures in spatial data with targeted collection of additional samples. The mechanism is designed around the spatial aggregation language (SAL) for qualitative spatial reasoning, and seeks to uncover highlevel spatial structures from only a sparse set of samples. This approach is important for applications in domains such as aircraft design, wireless system simulation, fluid dynamics, and sensor networks. The mechanism employs Gaussian processes, a formal mathematical model for reasoning about spatial data, in order to build surrogate models from sparse data, reason about the uncertainty of estimation at unsampled points, and formulate objective criteria for closingtheloop between data collection and data analysis. It optimizes sample selection using entropybased functionals defined over spatial aggregates instead of the traditional approach of sampling to minimize estimated variance. We apply this mechanism on a global optimization benchmark comprising a testbank of 2D functions, as well as on data from wireless system simulations. The results reveal that the proposed sampling strategy makes more judicious use of data points by selecting locations that clarify highlevel structures in data, rather than choosing points that merely improve quality of function approximation.
Using Trees to Depict a Forest
 PVLDB
"... When a database query has a large number of results, the user can only be shown one page of results at a time. One popular approach is to rank results such that the “best ” results appear first. However, standard database query results comprise a set of tuples, with no associated ranking. It is typi ..."
Abstract

Cited by 15 (0 self)
 Add to MetaCart
When a database query has a large number of results, the user can only be shown one page of results at a time. One popular approach is to rank results such that the “best ” results appear first. However, standard database query results comprise a set of tuples, with no associated ranking. It is typical to allow users the ability to sort results on selected attributes, but no actual ranking is defined. An alternative approach to the first page is not to try to show the best results, but instead to help users learn what is available in the whole result set and direct them to finding what they need. In this paper, we demonstrate through a user study that a page comprising one representative from each of k clusters (generated through a kmedoid clustering) is superior to multiple alternative candidate methods for generating representatives of a data set. Users often refine query specifications based on returned results. Traditional clustering may lead to completely new representatives after a refinement step. Furthermore, clustering can be computationally expensive. We propose a treebased method for efficiently generating the representatives, and smoothly adapting them with query refinement. Experiments show that our algorithms outperform the stateoftheart in both result quality and efficiency.
On Effective Presentation of Graph Patterns: A Structural Representative Approach
 in Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM'08
, 2008
"... In the past, quite a few fast algorithms have been developed to mine frequent patterns over graph data, with the large spectrum covering many variants of the problem. However, the real bottleneck for knowledge discovery on graphs is neither efficiency nor scalability, but the usability of patterns t ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
In the past, quite a few fast algorithms have been developed to mine frequent patterns over graph data, with the large spectrum covering many variants of the problem. However, the real bottleneck for knowledge discovery on graphs is neither efficiency nor scalability, but the usability of patterns that are mined out. Currently, what the stateofart techniques give is a lengthy list of exact patterns, which are undesirable in the following two aspects: (1) on the micro side, due to various inherent noises or data diversity, exact patterns are usually not too useful in many real applications; and (2) on the macro side, the rigid structural requirement being posed often generates an excessive amount of patterns that are only slightly different from each other, which easily overwhelm the users. In this paper, we study the presentation problem of graph patterns, where structural representatives are deemed as the key mechanism to make the whole strategy effective. As a solution to fill the usability gap, we adopt a twostep smoothingclustering framework, with the first step adding error tolerance to individual patterns (the micro side), and the second step reducing output cardinality by collapsing multiple structurally similar patterns into one representative (the macro side). This novel, integrative approach is never tried in previous studies, which essentially rollsup our attention to a more appropriate level that no longer looks into every minute detail. The above framework is general, which may apply under various settings and incorporate a lot of extensions. Empirical studies indicate that a compact group of informative delegates can be achieved on real datasets and the proposed algorithms are both efficient and scalable.
Antipole Tree indexing to support range search and Knearest neighbor search in metric spaces
 IEEE/TKDE
, 2005
"... Range and knearest neighbor searching are core problems in pattern recognition. Given a database S of objects in a metric space M and a query object q in M, in a range searching problem the target is to find the objects of S within some threshold distance to q, whereas in a knearest neighbor searc ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Range and knearest neighbor searching are core problems in pattern recognition. Given a database S of objects in a metric space M and a query object q in M, in a range searching problem the target is to find the objects of S within some threshold distance to q, whereas in a knearest neighbor searching problem, the k elements of S closest to q must be produced. These problems can obviously be solved with a linear number of distance calculations, by comparing the query object against every object in the database. However, the goal is to solve such problems much faster. We combine and extend ideas from the MTree, the MultiVantage Point structure, and the FQTree to create a new structure in the “bisector tree ” class, called the Antipole Tree. Bisection is based on the proximity to an “Antipole ” pair of elements generated by a suitable linear randomized tournament. The final winners a, b of such a tournament are far enough apart to approximate the diameter of the splitting set. If dist(a, b) is larger than the chosen cluster diameter threshold, then the cluster is split. The proposed data structure is an indexing scheme suitable for (exact and approximate) best match searching on generic metric spaces. The Antipole Tree compares very well with existing structures such as List of Clusters, MTrees and others, and in many cases it achieves better results.
Efficient kernel clustering using random fourier features
 In Proceedings of ICDM’12
, 2012
"... Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nons ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Abstract—Kernel clustering algorithms have the ability to capture the nonlinear structure inherent in many real world data sets and thereby, achieve better clustering performance than Euclidean distance based clustering algorithms. However, their quadratic computational complexity renders them nonscalable to large data sets. In this paper, we employ random Fourier maps, originally proposed for large scale classification, to accelerate kernel clustering. The key idea behind the use of random Fourier maps for clustering is to project the data into a lowdimensional space where the inner product of the transformed data points approximates the kernel similarity between them. An efficient linear clustering algorithm can then be applied to the points in the transformed space. We also propose an improved scheme which uses the top singular vectors of the transformed data matrix to perform clustering, and yields a better approximation of kernel clustering under appropriate conditions. Our empirical studies demonstrate that the proposed schemes can be efficiently applied to large data sets containing millions of data points, while achieving accuracy similar to that achieved by stateoftheart kernel clustering algorithms. KeywordsKernel clustering, Kernel kmeans, Random Fourier features, Scalability