Results 1  10
of
31
A unified view on clustering binary data
 Machine Learning
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of dat ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. Binary data have been occupying a special place in the domain of data analysis. A unified view of binary data clustering is presented by examining the connections among various clustering criteria. Experimental studies are conducted to empirically verify the relationships. 1
Detecting the change of clustering structure in categorical data streams
 SIAM Data Mining Conference
, 2006
"... Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Analyzing clustering structures in data streams can provide critical information for making decision in realtime. Most research has been focused on clustering algorithms for data streams. We argue that, more importantly, we need to monitor the change of clustering structure online. In this paper, we present a framework for detecting the change of critical clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework extends the work on determining the best K for static datasets (the BkPlot method) to categorical data streams with the help of a Hierarchical Entropy Tree structure (HETree). HETree can efficiently capture the entropy property of the categorical data streams and allow us to draw precise clustering information from the data stream for highquality BkPLots. The experiments show that with the combination of HETree and the BkPlot method we are able to efficiently and precisely detect the change of critical clustering structure in categorical data streams. 1
PseudoBound Optimization for Binary Energies
"... Abstract. Highorder and nonsubmodular pairwise energies are important for image segmentation, surface matching, deconvolution, tracking and other computer vision problems. Minimization of such energies is generally NPhard. One standard approximation approach is to optimize an auxiliary function ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Highorder and nonsubmodular pairwise energies are important for image segmentation, surface matching, deconvolution, tracking and other computer vision problems. Minimization of such energies is generally NPhard. One standard approximation approach is to optimize an auxiliary function an upper bound of the original energy across the entire solution space. This bound must be amenable to fast global solvers. Ideally, it should also closely approximate the original functional, but it is very dicult to nd such upper bounds in practice. Our main idea is to relax the upperbound condition for an auxiliary function and to replace it with a family of pseudobounds, which can better approximate the original energy. We use fast polynomial parametric max ow approach to explore all global minima for our family of submodular pseudobounds. The best solution is guaranteed to decrease the original energy because the family includes at least one auxiliary function. Our PseudoBound Cuts algorithm improves the stateoftheart in many applications: appearance entropy minimization, target distribution matching, curvature regularization, image deconvolution and interactive segmentation.
The ”best k” for entropybased categorical data clustering
 In Inter. Conf. on Scien. and Stat. Database Management
, 2005
"... With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of cluster ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
With the growing demand on cluster analysis for categorical data, a handful of categorical clustering algorithms have been developed. Surprisingly, to our knowledge, none has satisfactorily addressed the important problem for categorical clustering – how can we determine the best K number of clusters for a categorical dataset? Since the categorical data does not have the inherent distance function as the similarity measure, the traditional cluster validation techniques based on the geometry shape and density distribution cannot be applied to answer this question. In this paper, we investigate the entropy property of the categorical data and propose a BkPlot method for determining a set of candidate “best Ks”. This method is implemented with a hierarchical clustering algorithm HierEntro. The experimental result shows that our approach can effectively identify the significant clustering structures.
On clustering binary data
 Proceedings of the 2005 SIAM International Conference On Data Mining(SDM’05
, 2005
"... Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions co ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
Clustering is the problem of identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity classes. This paper studies the problem of clustering binary data. This is the case for market basket datasets where the transactions contain items and for document datasets where the documents contain “bag of words”. The contribution of the paper is twofold. First a new clustering model is presented. The model treats the data and features equally, based on their symmetric association relations, and explicitly describes the data assignments as well as feature assignments. An iterative alternating leastsquares procedure is used for optimization. Second, a unified view of binary data clustering is presented by examining the connections among various clustering criteria. 1
SCALE: A Scalable Framework for Efficiently Clustering Transactional Data
, 2009
"... This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weig ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
This paper presents SCALE, a fully automated transactional clustering framework. The SCALE design highlights three unique features. First, we introduce the concept of Weighted Coverage Density as a categorical similarity measure for efficient clustering of transactional datasets. The concept of weighted coverage density is intuitive and it allows the weight of each item in a cluster to be changed dynamically according to the occurrences of items. Second, we develop the weighted coverage density measure based clustering algorithm, a fast, memoryefficient, and scalable clustering algorithm for analyzing transactional data. Third, we introduce two clustering validation metrics and show that these domain specific clustering evaluation metrics are critical to capture the transactional semantics in clustering analysis. Our SCALE framework combines the weighted coverage density measure for clustering over a sample dataset with selfconfiguring methods. These selfconfiguring methods can automatically tune the two important parameters of our clustering algorithms: (1) the candidates of the best number K of clusters; and (2) the application of two domainspecific cluster validity measures to find the best result from the set of clustering results. We have conducted extensive experimental evaluation using both synthetic and real datasets and our results show that the weighted coverage density approach powered by the SCALE framework can efficiently generate high quality clustering results in a fully automated manner. key words: transactional data clustering, cluster assessment, cluster validation, frequent itemset mining, weighted coverage density
Nonredundant clustering
, 2005
"... Data mining and knowledge discovery attempt to reveal concepts, patterns, relationships, and structures of interest in data. Typically, data may have many such structures. Most existing data mining techniques allow the user little say in which structure will be returned from the search. Those techn ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Data mining and knowledge discovery attempt to reveal concepts, patterns, relationships, and structures of interest in data. Typically, data may have many such structures. Most existing data mining techniques allow the user little say in which structure will be returned from the search. Those techniques which do allow the user control over the search typically require supervised information in the form of knowledge about a target solution. In the spirit of exploratory data mining, we consider the setting where the user does not have information about a target solution. Instead we suppose the user can provide information about solutions which are not desired. These undesired solutions may be previously obtained from data mining algorithms, or they may be known to the user a priori. The goal is then to discover novel structure in the dataset which is not redundant with respect to the known structure. Techniques should guide the search away from this known structure and towards novel, interesting structures. We describe and formally define the task of nonredundant clustering. Three different algorithmic approaches are derived for nonredundant clustering. Their performance is experimentally evaluated on data sets containing multiple clusterings. We explore how these techniques may be extended to systematically enumerate clusterings in a data set. Finally, we also investigate whether nonredundant approaches may be incorporated to enhance stateoftheart supervised techniques.
HETree: a Framework for Detecting Changes in Clustering Structure for Categorical Data Streams
"... Analyzing clustering structures in data streams can provide critical information for realtime decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Analyzing clustering structures in data streams can provide critical information for realtime decision making. Most research in this area has focused on clustering algorithms for numerical data streams, and very few have proposed to monitor the change of clustering structure. Most surprisingly, to our knowledge, no work has been proposed on monitoring clustering structure for categorical data streams. In this paper, we present a framework for detecting the change of primary clustering structure in categorical data streams, which is indicated by the change of the best number of clusters (Best K) in the data stream. The framework uses a Hierarchical Entropy Tree structure (HETree) to capture the entropy characteristics of clusters in a data stream, and detects the change of Best K by combining our previously developed BKPlot method. The HETree can efficiently summarize the entropy property of a categorical data stream and allow us to draw precise clustering information from the data stream for generating highquality BKPlots. We also develop the timedecaying HETree structure to make the monitoring more sensitive to recent changes of clustering structure. The experimental result shows that with the combination of the HETree and the BKPlot method we are able to promptly and precisely detect the change of clustering structure in categorical data streams.
Online Entropybased Model of Lexical Category Acquisition
"... Children learn a robust representation of lexical categories at a young age. We propose an incremental model of this process which efficiently groups words into lexical categories based on their local context using an informationtheoretic criterion. We train our model on a corpus of childdirected s ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Children learn a robust representation of lexical categories at a young age. We propose an incremental model of this process which efficiently groups words into lexical categories based on their local context using an informationtheoretic criterion. We train our model on a corpus of childdirected speech from CHILDES and show that the model learns a finegrained set of intuitive word categories. Furthermore, we propose a novel evaluation approach by comparing the efficiency of our induced categories against other category sets (including traditional part of speech tags) in a variety of language tasks. We show the categories induced by our model typically