Results 1  10
of
195
Automatic Subspace Clustering of High Dimensional Data
 Data Mining and Knowledge Discovery
, 2005
"... Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the or ..."
Abstract

Cited by 724 (12 self)
 Add to MetaCart
Data mining applications place special requirements on clustering algorithms including: the ability to find clusters embedded in subspaces of high dimensional data, scalability, enduser comprehensibility of the results, nonpresumption of any canonical data distribution, and insensitivity to the order of input records. We present CLIQUE, a clustering algorithm that satisfies each of these requirements. CLIQUE identifies dense clusters in subspaces of maximum dimensionality. It generates cluster descriptions in the form of DNF expressions that are minimized for ease of comprehension. It produces identical results irrespective of the order in which input records are presented and does not presume any specific mathematical form for data distribution. Through experiments, we show that CLIQUE efficiently finds accurate clusters in large high dimensional datasets.
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 400 (0 self)
 Add to MetaCart
(Show Context)
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
Subspace clustering for high dimensional data: a review
 ACM SIGKDD Explorations Newsletter
, 2004
"... Subspace clustering for high dimensional data: ..."
(Show Context)
Clustering by Pattern Similarity in Large Data Sets
 In SIGMOD
"... Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, ..."
Abstract

Cited by 181 (19 self)
 Add to MetaCart
(Show Context)
Clustering is the process of grouping a set of objects into classes of similar objects. Although definitions of similarity vary from one clustering model to another, in most of these models the concept of similarity is based on distances, e.g., Euclidean distance or cosine distance. In other words, similar objects are required to have close values on at least a set of dimensions. In this paper, we explore a more general type of similarity. Under the pCluster model we proposed, two objects are similar if they exhibit a coherent pattern on a subset of dimensions. For instance, in DNA microarray analysis, the expression levels of two genes may rise and fall synchronously in response to a set of environmental stimuli. Although the magnitude of their expression levels may not be close, the patterns they exhibit can be very much alike. Discovery of such clusters of genes is essential in revealing significant connections in gene regulatory networks. Ecommerce applications, such as collaborative filtering, can also benefit from the new model, which captures not only the closeness of values of certain leading indicators but also the closeness of (purchasing, browsing, etc.) patterns exhibited by the customers. Our paper introduces an effective algorithm to detect such clusters, and we perform tests on several real and synthetic data sets to show its effectiveness.
A monte carlo algorithm for fast projective clustering
 In Proceedings of the 2002 ACM SIGMOD International conference on Management of data
, 2002
"... We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
(Show Context)
We propose a mathematical formulation for the notion of optimal projective cluster, starting from natural requirements on the density of points in subspaces. This allows us to develop a Monte Carlo algorithm for iteratively computing projective clusters. We prove that the computed clusters are good with high probability. We implemented a modified version of the algorithm, using heuristics to speed up computation. Our extensive experiments show that our method is significantly more accurate than previous approaches. In particular, we use our techniques to build a classifier for detecting rotated human faces in cluttered images. 1. PROJECTIVE CLUSTERING Clustering is a widely used technique for data mining, indexing, and classification. Many practical methods proposed in the last few years, such as CLARANS [11], BIRCH [15], DBSCAN [5, 6], and
Densityconnected subspace clustering for highdimensional data
 IN: PROC. SDM. (2004
, 2004
"... Several application domains such as molecular biology and geography produce a tremendous amount of data which can no longer be managed without the help of efficient and effective data mining methods. One of the primary data mining tasks is clustering. However, traditional clustering algorithms often ..."
Abstract

Cited by 67 (14 self)
 Add to MetaCart
Several application domains such as molecular biology and geography produce a tremendous amount of data which can no longer be managed without the help of efficient and effective data mining methods. One of the primary data mining tasks is clustering. However, traditional clustering algorithms often fail to detect meaningful clusters because most realworld data sets are characterized by a high dimensional, inherently sparse data space. Nevertheless, the data sets often contain interesting clusters which are hidden in various subspaces of the original feature space. Therefore, the concept of subspace clustering has recently been addressed, which aims at automatically identifying subspaces of the feature space in which clusters exist. In this paper, we introduce SUBCLU (densityconnected Subspace Clustering), an effective and efficient approach to the subspace clustering problem. Using the concept of densityconnectivity underlying the algorithm DBSCAN [EKSX96], SUBCLU is based on a formal clustering notion. In contrast to existing gridbased approaches, SUBCLU is able to detect arbitrarily shaped and positioned clusters in subspaces. The monotonicity of densityconnectivity is used to efficiently prune subspaces in the process of generating all clusters in a bottom up way. While not examining any unnecessary subspaces, SUBCLU delivers for each subspace the same clusters DBSCAN would have found, when applied to this subspace separately.
Clustering binary data streams with Kmeans
 In Proc. ACM SIGMOD Data Mining and Knowledge Discovery Workshop
, 2003
"... Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that nds higher quality soluti ..."
Abstract

Cited by 63 (9 self)
 Add to MetaCart
(Show Context)
Clustering data streams is an interesting Data Mining problem. This article presents three variants of the Kmeans algorithm to cluster binary data streams. The variants include Online Kmeans, Scalable Kmeans, and Incremental Kmeans, a proposed variant introduced that nds higher quality solutions in less time. Higher quality of solutions are obtained with a meanbased initialization and incremental learning. The speedup is achieved through a simplied set of sucient statistics and operations with sparse matrices. A summary table of clusters is maintained online. The Kmeans variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions. 1.
Clustering Through Decision Tree Construction
 In SIGMOD00
, 2000
"... this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster and empty (spars ..."
Abstract

Cited by 62 (0 self)
 Add to MetaCart
(Show Context)
this paper, we propose a novel clustering technique, which is based on a supervised learning technique called decision tree construction. The new technique is able to overcome many of these shortcomings. The key idea is to use a decision tree to partition the data space into cluster and empty (sparse) regions at different levels of details. The technique is able to find "natural" clusters in large high dimensional spaces efficiently. It is suitable for clustering in the full dimensional space as well as in subspaces. It also provides comprehensible descriptions of clusters. Experiment results on both synthetic data and reallife data show that the technique is effective and also scales well for large high dimensional datasets.
OPCluster: Clustering by Tendency in High Dimensional Space
, 2003
"... Clustering is the process of grouping a set of objects into classes of similar objects. Because of unknownness of the hidden patterns in the data sets, the definition of similarity is very subtle. Until recently, similarity measures are typically based on distances, e.g Euclidean distance and cosine ..."
Abstract

Cited by 60 (5 self)
 Add to MetaCart
Clustering is the process of grouping a set of objects into classes of similar objects. Because of unknownness of the hidden patterns in the data sets, the definition of similarity is very subtle. Until recently, similarity measures are typically based on distances, e.g Euclidean distance and cosine distance. In this paper, we propose a flexible yet powerful clustering model, namely OPCluster (Order Preserving Cluster). Under this new model, two objects are similar on a subset of dimensions if the values of these two objects induce the same relative order of those dimensions. Such a cluster might arise when the expression levels of (coregulated) genes can rise or fall synchronously in response to a sequence of environment stimuli. Hence, discovery of OPCluster is essential in revealing significant gene regulatory networks. A deterministic algorithm is designed and implemented to discover all the significant OPClusters. A set of extensive experiments has been done on several real biological data sets to demonstrate its effectiveness and efficiency in detecting coregulated patterns.
A Multimodal Learning Interface for Grounding Spoken Language in Sensory Perceptions
 ACM TRANSACTIONS ON APPLIED PERCEPTION
, 2004
"... Most speech interfaces are based on natural language processing techniques that use predefined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal humancomputer interaction, computers need to ..."
Abstract

Cited by 57 (5 self)
 Add to MetaCart
Most speech interfaces are based on natural language processing techniques that use predefined symbolic representations of word meanings and process only linguistic information. To understand and use language like their human counterparts in multimodal humancomputer interaction, computers need to acquire spoken language and map it to other sensory perceptions. This paper presents a multimodal interface that learns to associate spoken language with perceptual features by being situated in users' everyday environments and sharing usercentric multisensory information. The learning interface is trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with multisensory information from nonspeech modalities, such as user's perspective video, gaze positions, head directions and hand movements. The system firstly estimates users' focus of attention from eye and head cues. Attention, as represented by gaze fixation, is used for spotting the target object of user interest. Attention switches are calculated and used to segment an action sequence into action units which are then categorized by mixture hidden Markov models. A multimodal learning algorithm is developed to spot words from continuous speech and then associate them with perceptually grounded meanings extracted from visual perception and action. Successful learning has been demonstrated in the experiments of three natural tasks: "unscrewing a jar", "stapling a letter" and "pouring water".