Results 1  10
of
466
Consistency of spectral clustering
, 2004
"... Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spe ..."
Abstract

Cited by 572 (15 self)
 Add to MetaCart
(Show Context)
Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spectral clustering algorithms, which cluster the data with the help of eigenvectors of graph Laplacian matrices. We show that one of the two of major classes of spectral clustering (normalized clustering) converges under some very general conditions, while the other (unnormalized), is only consistent under strong additional assumptions, which, as we demonstrate, are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering in practical applications. We believe that methods used in our analysis will provide a basis for future exploration of Laplacianbased methods in a statistical setting.
Biclustering algorithms for biological data analysis: a survey.
 IEEE/ACM Transactions of Computational Biology and Bioinformatics,
, 2004
"... Abstract A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a num ..."
Abstract

Cited by 481 (15 self)
 Add to MetaCart
(Show Context)
Abstract A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the gene expression matrix has been proposed to date. This simultaneous clustering, usually designated by biclustering, seeks to find submatrices, that is subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This type of algorithms has also been proposed and used in other fields, such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search and the target applications.
Survey of clustering data mining techniques
, 2002
"... Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in math ..."
Abstract

Cited by 408 (0 self)
 Add to MetaCart
Accrue Software, Inc. Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique
InformationTheoretic CoClustering
 In KDD
, 2003
"... Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views ..."
Abstract

Cited by 346 (12 self)
 Add to MetaCart
(Show Context)
Twodimensional contingency or cooccurrence tables arise frequently in important applications such as text, weblog and marketbasket data analysis. A basic problem in contingency table analysis is coclustering: simultaneous clustering of the rows and columns. A novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the coclustering problem as an optimization problem in information theory  the optimal coclustering maximizes the mutual information between the clustered random variables subject to constraints on the number of row and column clusters.
On Clusterings: Good, Bad and Spectral
, 2003
"... We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic is shown to have polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of ..."
Abstract

Cited by 332 (11 self)
 Add to MetaCart
We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic is shown to have polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of a popular spectral algorithm. One variant of spectral clustering turns out to have effective worstcase guarantees; another finds a "good" clustering, if one exists.
Transductive Learning via Spectral Graph Partitioning
 In ICML
, 2003
"... We present a new method for transductive learning, which can be seen as a transductive version of the k nearestneighbor classifier. ..."
Abstract

Cited by 237 (0 self)
 Add to MetaCart
(Show Context)
We present a new method for transductive learning, which can be seen as a transductive version of the k nearestneighbor classifier.
Criterion Functions for Document Clustering: Experiments and Analysis
, 2002
"... In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and org ..."
Abstract

Cited by 202 (13 self)
 Add to MetaCart
(Show Context)
In recent years, we have witnessed a tremendous growth in the volume of text documents available on the Internet, digital libraries, news sources, and companywide intranets. This has led to an increased interest in developing methods that can help users to effectively navigate, summarize, and organize this information with the ultimate goal of helping them to find what they are looking for. Fast and highquality document clustering algorithms play an important role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanism by organizing large amounts of information into a small number of meaningful clusters as well as to greatly improve the retrieval performance either via clusterdriven dimensionality reduction, termweighting, or query expansion. This everincreasing importance of document clustering and the expanded range of its applications led to the development of a number of new and novel algorithms with different complexityquality tradeoffs. Among them, a class of clustering algorithms that have relatively low computational requirements are those that treat the clustering problem as an optimization process which seeks to maximize or minimize a particular clustering criterion function defined over the entire clustering solution.
Classification in Networked Data: A toolkit and a univariate case study
, 2006
"... This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a casestudy of its application to networked data used in prior machine learning resear ..."
Abstract

Cited by 200 (10 self)
 Add to MetaCart
This paper is about classifying entities that are interlinked with entities for which the class is known. After surveying prior work, we present NetKit, a modular toolkit for classification in networked data, and a casestudy of its application to networked data used in prior machine learning research. NetKit is based on a nodecentric framework in which classifiers comprise a local classifier, a relational classifier, and a collective inference procedure. Various existing nodecentric relational learning algorithms can be instantiated with appropriate choices for these components, and new combinations of components realize new algorithms. The case study focuses on univariate network classification, for which the only information used is the structure of class linkage in the network (i.e., only links and some class labels). To our knowledge, no work previously has evaluated systematically the power of classlinkage alone for classification in machine learning benchmark data sets. The results demonstrate that very simple networkclassification models perform quite well—well enough that they should be used regularly as baseline classifiers for studies of learning with networked data. The simplest method (which performs remarkably well) highlights the close correspondence between several existing methods introduced for different purposes—i.e., Gaussianfield classifiers, Hopfield networks, and relationalneighbor classifiers. The case study also shows that there are two sets of techniques that are preferable in different situations, namely when few versus many labels are known initially. We also demonstrate that link selection plays an important role similar to traditional feature selection.
Detecting Unusual Activity in Video
, 2004
"... We present an unsupervised technique for detecting unusual activity in a large video set using many simple features. No complex activity models and no supervised feature selections are used. We divide the video into equal length segments and classify the extracted features into prototypes, from whic ..."
Abstract

Cited by 182 (0 self)
 Add to MetaCart
We present an unsupervised technique for detecting unusual activity in a large video set using many simple features. No complex activity models and no supervised feature selections are used. We divide the video into equal length segments and classify the extracted features into prototypes, from which a prototypesegment cooccurrence matrix is computed. Motivated by a similar problem in documentkeyword analysis, we seek a correspondence relationship between prototypes and video segments which satisfies the transitive closure constraint. We show that an important subfamily of correspondence functions can be reduced to coembedding prototypes and segments to ND Euclidean space. We prove that an efficient, globally optimal algorithm exists for the coembedding problem. Experiments on various reallife videos have validated our approach.
A Generalized Maximum Entropy Approach to Bregman Coclustering and Matrix Approximation
 In KDD
, 2004
"... Coclustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic coclustering approach applicable to empirical joint probability distributions was proposed. In many situations, coclust ..."
Abstract

Cited by 135 (29 self)
 Add to MetaCart
Coclustering is a powerful data mining technique with varied applications such as text clustering, microarray analysis and recommender systems. Recently, an informationtheoretic coclustering approach applicable to empirical joint probability distributions was proposed. In many situations, coclustering of more general matrices is desired. In this paper, we present a substantially generalized coclustering framework wherein any Bregman divergence can be used in the objective function, and various conditional expectation based constraints can be considered based on the statistics that need to be preserved. Analysis of the coclustering problem leads to the minimum Bregman information principle, which generalizes the maximum entropy principle, and yields an elegant meta algorithm that is guaranteed to achieve local optimality. Our methodology yields new algorithms and also encompasses several previously known clustering and coclustering algorithms based on alternate minimization.