Results 1  10
of
406
Coclustering documents and words using Bipartite Spectral Graph Partitioning
, 2001
"... ..."
(Show Context)
InformationTheoretic CoClustering
 In KDD
, 2003
"... Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views ..."
Abstract

Cited by 346 (12 self)
 Add to MetaCart
(Show Context)
Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the coclustering problem as an optimization problem in information theory  the optimal coclustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters.
A Probabilistic Framework for SemiSupervised Clustering
, 2004
"... Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supe ..."
Abstract

Cited by 277 (14 self)
 Add to MetaCart
(Show Context)
Unsupervised clustering can be significantly improved using supervision in the form of pairwise constraints, i.e., pairs of instances labeled as belonging to same or different clusters. In recent years, a number of algorithms have been proposed for enhancing clustering quality by employing such supervision. Such methods use the constraints to either modify the objective function, or to learn the distance measure. We propose a probabilistic model for semisupervised clustering based on Hidden Markov Random Fields (HMRFs) that provides a principled framework for incorporating supervision into prototypebased clustering. The model generalizes a previous approach that combines constraints and Euclidean distance learning, and allows the use of a broad range of clustering distortion measures, including Bregman divergences (e.g., Euclidean distance and Idivergence) and directional similarity measures (e.g., cosine similarity). We present an algorithm that performs partitional semisupervised clustering of data by minimizing an objective function derived from the posterior energy of the HMRF model. Experimental results on several text data sets demonstrate the advantages of the proposed framework. 1.
Evaluation of Hierarchical Clustering Algorithms for Document Datasets
 Data Mining and Knowledge Discovery
, 2002
"... Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at ..."
Abstract

Cited by 258 (6 self)
 Add to MetaCart
Fast and highquality document clustering algorithms play an important role in providing intuitive navigation and browsing mechanisms by organizing large amounts of information into a small number of meaningful clusters. In particular, hierarchical clustering solutions provide a view of the data at different levels of granularity, making them ideal for people to visualize and interactively explore large document collections.
Criterion Functions for Document Clustering: Experiments and Analysis
, 2002
"... In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and org ..."
Abstract

Cited by 202 (13 self)
 Add to MetaCart
(Show Context)
In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and highquality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via clusterdriven dimensionality reduction, termweighting, or query expansion. This everincreasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexityquality tradeoffs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution.
Active SemiSupervision for Pairwise Constrained Clustering
 Proc. 4th SIAM Intl. Conf. on Data Mining (SDM2004
"... Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for acti ..."
Abstract

Cited by 137 (9 self)
 Add to MetaCart
(Show Context)
Semisupervised clustering uses a small amount of supervised data to aid unsupervised learning. One typical approach specifies a limited number of mustlink and cannotlink constraints between pairs of examples. This paper presents a pairwise constrained clustering framework and a new method for actively selecting informative pairwise constraints to get improved clustering performance. The clustering and active learning methods are both easily scalable to large datasets, and can handle very high dimensional data. Experimental and theoretical results confirm that this active querying of pairwise constraints significantly improves the accuracy of clustering when given a relatively small amount of supervision. 1
A dataclustering algorithm on distributed memory multiprocessors.
 In LargeScale Parallel Data Mining, Lecture Notes in Artificial Intelligence,
, 2000
"... Abstract. To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We a ..."
Abstract

Cited by 134 (1 self)
 Add to MetaCart
(Show Context)
Abstract. To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the kmeans clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent dataparallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops.
Empirical and theoretical comparisons of selected criterion functions for document clustering
 Machine Learning
"... Abstract. This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed i ..."
Abstract

Cited by 117 (6 self)
 Add to MetaCart
(Show Context)
Abstract. This paper evaluates the performance of different criterion functions in the context of partitional clustering algorithms for document datasets. Our study involves a total of seven different criterion functions, three of which are introduced in this paper and four that have been proposed in the past. We present a comprehensive experimental evaluation involving 15 different datasets, as well as an analysis of the characteristics of the various criterion functions and their effect on the clusters they produce. Our experimental results show that there are a set of criterion functions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to the best overall results. Our theoretical analysis shows that the relative performance of the criterion functions depends on (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree to which they can lead to reasonably balanced clusters. Keywords:
Orthogonal nonnegative matrix trifactorizations for clustering
 In SIGKDD
, 2006
"... Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constr ..."
Abstract

Cited by 117 (22 self)
 Add to MetaCart
(Show Context)
Currently, most research on nonnegative matrix factorization (NMF) focus on 2factor X = FG T factorization. We provide a systematic analysis of 3factor X = FSG T NMF. While unconstrained 3factor NMF is equivalent to unconstrained 2factor NMF, constrained 3factor NMF brings new features to constrained 2factor NMF. We study the orthogonality constraint because it leads to rigorous clustering interpretation. We provide new rules for updating F,S,G and prove the convergence of these algorithms. Experiments on 5 datasets and a real world case study are performed to show the capability of biorthogonal 3factor NMF on simultaneously clustering rows and columns of the input data matrix. We provide a new approach of evaluating the quality of clustering on words using class aggregate distribution and multipeak distribution. We also provide an overview of various NMF extensions and examine their relationships.
Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models
 SUBMISSION TO IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2007
"... We propose a novel unsupervised learning framework to model activities and interactions in crowded and complicated scenes. Under our framework hierarchical Bayesian models are used to connect three elements in visual surveillance: lowlevel visual features, simple “atomic ” activities, and interacti ..."
Abstract

Cited by 116 (16 self)
 Add to MetaCart
We propose a novel unsupervised learning framework to model activities and interactions in crowded and complicated scenes. Under our framework hierarchical Bayesian models are used to connect three elements in visual surveillance: lowlevel visual features, simple “atomic ” activities, and interactions. Atomic activities are modeled as distributions over lowlevel visual features, and multiagent interactions are modeled as distributions over atomic activities. These models are learnt in an unsupervised way. Given a long video sequence, moving pixels are clustered into different atomic activities and short video clips are clustered into different interactions. In this paper, we propose three hierarchical Bayesian models, Latent Dirichlet Allocation (LDA) mixture model, Hierarchical Dirichlet Process (HDP) mixture model, and two dimensional HDP (2DHDP) model. They advance existing language models, such as LDA [1] and HDP [2]. Directly using existing LDA and HDP models under our framework, only moving pixels can be clustered into atomic activities. Our models can cluster both moving pixels and video clips into atomic activities and interactions. LDA mixture model assumes that it is already known how many different types of atomic activities and interactions occur in the scene. HDP mixture model automatically decides the number of categories of atomic activities. 2DHDP automatically decides the numbers of