Results 1  10
of
90
Subspace clustering for high dimensional data: a review
 ACM SIGKDD Explorations Newsletter
, 2004
"... Subspace clustering for high dimensional data: ..."
A divideandmerge methodology for clustering
 ACM Transactions on Database Systems
, 2005
"... We present a divideandmerge methodology for clustering a set of objects that combines a topdown “divide ” phase with a bottomup “merge ” phase. In contrast, previous algorithms use either topdown or bottomup methods for constructing a hierarchical clustering or produce a flat clustering using ..."
Abstract

Cited by 71 (9 self)
 Add to MetaCart
We present a divideandmerge methodology for clustering a set of objects that combines a topdown “divide ” phase with a bottomup “merge ” phase. In contrast, previous algorithms use either topdown or bottomup methods for constructing a hierarchical clustering or produce a flat clustering using local search (e.g. kmeans). Our divide phase produces a tree whose leaves are the elements of the set. For this phase, we suggest an efficient spectral algorithm. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, e.g., kmeans, mindiameter, minsum, correlation clustering, etc. We present a metasearch engine that clusters results from web searches. We also give empirical results on textbased data where the algorithm performs better than or competitively with existing clustering algorithms. 1
EntropyBased Criterion in Categorical Clustering
 Proc. of Intl. Conf. on Machine Learning (ICML
, 2004
"... Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and e ..."
Abstract

Cited by 33 (4 self)
 Add to MetaCart
(Show Context)
Entropytype measures for the heterogeneity of clusters have been used for a long time. This paper studies the entropybased criterion in clustering categorical data. It first shows that the entropybased criterion can be derived in the formal framework of probabilistic clustering models and establishes the connection between the criterion and the approach based on dissimilarity coefficients.
scalable clustering of categorical data
 In EDBT
, 2004
"... Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
(Show Context)
Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a tradeoff between efficiency (in terms of space and time) and quality. We quantify this tradeoff and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality. 1
Game Theoretic Approach to Threat Prediction and Situation Awareness
"... Abstract The strategy of data fusion has been applied in threat prediction and situation awareness and the terminology has been standardized by the Joint Directors of Laboratories (JDL) in the form of a socalled JDL Data Fusion Model, which currently called DFIG model. Higher levels of the DFIG mod ..."
Abstract

Cited by 20 (10 self)
 Add to MetaCart
Abstract The strategy of data fusion has been applied in threat prediction and situation awareness and the terminology has been standardized by the Joint Directors of Laboratories (JDL) in the form of a socalled JDL Data Fusion Model, which currently called DFIG model. Higher levels of the DFIG model call for prediction of future development and awareness of the development of a situation. It is known that Bayesian Network is an insightful approach to determine optimal strategies against asymmetric adversarial opponent. However, it lacks the essential adversarial decision processes perspective. In this paper, a highly innovative datafusion framework for asymmetricthreat detection and prediction based on advanced knowledge infrastructure and stochastic (Markov) game theory is proposed. In particular, asymmetric and adaptive threats are detected and grouped by intelligent agent and Hierarchical Entity Aggregation in Level 2 and their intents are predicted by a decentralized Markov (stochastic) game model with deception in Level 3. We have verified that our proposed algorithms are scalable, stable, and perform satisfactorily according to the situation awareness performance metric.
LIMBO: Scalable clustering of categorical data
 In 9th Int’l Conf. on Extending DataBase Technology
, 2004
"... Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
(Show Context)
Abstract. Clustering is a problem of great practical importance in numerous applications. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measure between data values. We introduce LIMBO, a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. As a hierarchical algorithm, LIMBO has the advantage that it can produce clusterings of different sizes in a single execution. We use the IB framework to define a distance measure for categorical tuples and we also present a novel distance measure for categorical attribute values. We show how the LIMBO algorithm can be used to cluster both tuples and values. LIMBO handles large data sets by producing a memory bounded summary model for the data. We present an experimental evaluation of LIMBO, and we study how clustering quality compares to other categorical clustering algorithms. LIMBO supports a tradeoff between efficiency (in terms of space and time) and quality. We quantify this tradeoff and demonstrate that LIMBO allows for substantial improvements in efficiency with negligible decrease in quality. 1
CLICKS: An Effective Algorithm for Mining Subspace Clusters in Categorical Datasets
, 2005
"... We present a novel algorithm called Clicks, that finds clusters in categorical datasets based on a search for kpartite maximal cliques. Unlike previous methods, Clicks mines subspace clusters. It uses a selective vertical method to guarantee complete search. Clicks outperforms previous approaches b ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
We present a novel algorithm called Clicks, that finds clusters in categorical datasets based on a search for kpartite maximal cliques. Unlike previous methods, Clicks mines subspace clusters. It uses a selective vertical method to guarantee complete search. Clicks outperforms previous approaches by over an order of magnitude and scales better than any of the existing method for highdimensional datasets. These results are demonstrated in a comprehensive performance study on real and synthetic datasets.
CLICK: Clustering Categorical Data Using Kpartite Maximal Cliques
, 2004
"... Clustering is one of the central data mining problems and numerous approaches have been proposed in this field. However, few of these methods focus on categorical data. The categorical techniques that do exist have significant shortcomings in terms of performance, the clusters they detect, and their ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Clustering is one of the central data mining problems and numerous approaches have been proposed in this field. However, few of these methods focus on categorical data. The categorical techniques that do exist have significant shortcomings in terms of performance, the clusters they detect, and their ability to locate clusters in subspaces.
A unified view on clustering binary data
 Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of dat ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships. 1
On a Recursive Spectral Algorithm for Clustering from Pairwise Similarities
, 2003
"... We present a practical implementation of the clustering algorithm described in [20]. The clustering algorithm is given either an implicit or explicit representation of the pairwise similarities between n objects and produces a complete hierarchical clustering of the n objects. The implementation r ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
We present a practical implementation of the clustering algorithm described in [20]. The clustering algorithm is given either an implicit or explicit representation of the pairwise similarities between n objects and produces a complete hierarchical clustering of the n objects. The implementation runs in O(M log n) time per cluster where M is the number of nonzero entries in the \documentterm" matrix, a common implicit representation of similarities between data objects. We perform a thorough experimental evaluation of the algorithm in practice. The results show that the algorithm is better or competitive with existing clustering algorithms (e.g. kmeans [21], ROCK [18], pQR [37]).