Results 1  10
of
303
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 400 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Subspace clustering for high dimensional data: a review
 ACM SIGKDD Explorations Newsletter
, 2004
"... Subspace clustering for high dimensional data: ..."
(Show Context)
Finding Generalized Projected Clusters in High Dimensional Spaces
"... High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projec ..."
Abstract

Cited by 195 (8 self)
 Add to MetaCart
High dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that in high dimensional data, even the concept of proximity or clustering may not be meaningful. We discuss very general techniques for projected clustering which are able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to rede ne clustering for high dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable, and are likely to tradeoff with better accuracy.
Clustering by Pattern Similarity in Large Data Sets
 In SIGMOD
"... Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, ..."
Abstract

Cited by 181 (19 self)
 Add to MetaCart
(Show Context)
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. Ecommerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 154 (4 self)
 Add to MetaCart
(Show Context)
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Horting Hatches an Egg: A New GraphTheoretic Approach to Collaborative Filtering
 In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge discovery and data mining
, 1999
"... This paper introduces a new and novel approach to ratingbased collaborative filtering. The new technique is most appropriate for ecommerce merchants offering one or more groups of relatively homogeneous items such as compact disks, videos, books, software and the like. In contrast with other known ..."
Abstract

Cited by 137 (1 self)
 Add to MetaCart
This paper introduces a new and novel approach to ratingbased collaborative filtering. The new technique is most appropriate for ecommerce merchants offering one or more groups of relatively homogeneous items such as compact disks, videos, books, software and the like. In contrast with other known collaborative filtering techniques, the new algorithm is graphtheoretic, based on the twin new concepts of horting and predictability. As is demonstrated in this paper, the technique is fast, scalable, accurate, and requires only a modest learning curve. It makes use of a hierarchical classification scheme in order to introduce context into the rating process, and uses socalled creative links in order to find surprising and atypical items to recommend, perhaps even items which cross the group boundaries. The new technique is one of the key engines of the Intelligent Recommendation Algorithm (IRA) project, now being developed at IBM Research. In addition to several other recommendation engines, IRA contains a situation analyzer to determine the most appropriate mix of engines for a particular ecommerce merchant, as well as an engine for optimizing the placement of advertisements.
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces
, 2000
"... Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then in ..."
Abstract

Cited by 120 (3 self)
 Add to MetaCart
Many emerging application domains require database systems to support efficient access over highly multidimensional datasets. The current stateoftheart technique to indexing high dimensional data is to first reduce the dimensionality of the data using Principal Component Analysis and then indexing the reduced dimensionality space using a multidimensional index structure. The above technique, referred to as global dimensionality reduction (GDR), works well when the data set is globally correlated, i.e. most of the variation in the data can be captured by a few dimensions. In practice, datasets are often not globally correlated. In such cases, reducing the data dimensionality using GDR causes significant loss of distance information resulting in a large number of false positives and hence a high query cost. Even when a global correlation does not exist, there may exist subsets of data that are locally correlated. In this paper, we propose a technique called Local Dime...
A monte carlo algorithm for fast projective clustering
 In Proceedings of the 2002 ACM SIGMOD International conference on Management of data
, 2002
"... We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
(Show Context)
We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good with high probability. We implemented a modified version of the algorithm, using heuristics to speed up computation. Our extensive experiments show that our method is significantly more accurate than previous approaches. In particular, we use our techniques to build a classifier for detecting rotated human faces in cluttered images. 1. PROJECTIVE CLUSTERING Clustering is a widely used technique for data mining, indexing, and classification. Many practical methods proposed in the last few years, such as CLARANS [11], BIRCH [15], DBSCAN [5, 6], and
Matrix Approximation and Projective Clustering via Volume Sampling
, 2006
"... Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we genera ..."
Abstract

Cited by 92 (3 self)
 Add to MetaCart
Frieze, Kannan, and Vempala (JACM 2004) proved that a small sample of rows of a given matrix A spans the rows of a lowrank approximation D that minimizes A−DF within a small additive error, and the sampling can be done efficiently using just two passes over the matrix. In this paper, we generalize this result in two ways. First, we prove that the additive error drops exponentially by iterating the sampling in an adaptive manner (adaptive sampling). Using this result, we give a passefficient algorithm for computing a lowrank approximation with reduced additive error. Our second result is that there exist k rows of A whose span contains the rows of a multiplicative (k + 1)approximation to the best rankk matrix; moreover, this subset can be found by sampling ksubsets of rows from a natural distribution (volume sampling). Combining volume sampling with adaptive sampling yields the existence of a set of k + k(k + 1)/ε rows whose span contains the rows of a multiplicative (1 + ε)approximation. This leads to a PTAS for the following NPhard
iDistance: An Adaptive B+tree Based Indexing Method for Nearest Neighbor Search
"... In this paper, we present an efficient B+tree based indexing method, called iDistance, for Knearest neighbor (KNN) search in a highdimensional metric space. iDistance partitions the data based on a space or datapartitioning strategy, and selects a reference point for each partition. The data po ..."
Abstract

Cited by 90 (10 self)
 Add to MetaCart
In this paper, we present an efficient B+tree based indexing method, called iDistance, for Knearest neighbor (KNN) search in a highdimensional metric space. iDistance partitions the data based on a space or datapartitioning strategy, and selects a reference point for each partition. The data points in each partition are transformed into a single dimensional value based on their similarity with respect to the reference point. This allows the points to be indexed using a B +tree structure and KNN search to be performed using onedimensional range search. The choice of partition and reference point adapt the index structure to the data distribution. We conducted extensive experiments to evaluate the iDistance technique, and report results demonstrating its effectiveness. We also present a cost model for iDistance KNN search, which can be exploited in query optimization.