Results 1  10
of
15
A scalable framework for discovering coherent coclusters in noisy data
 In ICML ’08
"... A scalable framework for discovering coherent coclusters in noisy data ..."
Abstract

Cited by 19 (4 self)
 Add to MetaCart
(Show Context)
A scalable framework for discovering coherent coclusters in noisy data
PACBayesian Analysis of Coclustering and Beyond
"... We derive PACBayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as coclustering, matrix trifactorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of coclustering, which is a widely used approa ..."
Abstract

Cited by 14 (7 self)
 Add to MetaCart
We derive PACBayesian generalization bounds for supervised and unsupervised learning models based on clustering, such as coclustering, matrix trifactorization, graphical models, graph clustering, and pairwise clustering. 1 We begin with the analysis of coclustering, which is a widely used approach to the analysis of data matrices. We distinguish among two tasks in matrix data analysis: discriminative prediction of the missing entries in data matrices and estimation of the joint probability distribution of row and column variables in cooccurrence matrices. We derive PACBayesian generalization bounds for the expected outofsample performance of coclusteringbased solutions for these two tasks. The analysis yields regularization terms that were absent in the previous formulations of coclustering. The bounds suggest that the expected performance of coclustering is governed by a tradeoff between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this tradeoff for discriminative prediction tasks. This algorithm achieved stateoftheart performance in the MovieLens collaborative filtering task. Our coclustering model can also be seen as matrix trifactorization and the results provide generalization bounds, regularization
Approximation Algorithms for Tensor Clustering
 INTERNATIONAL CONFERENCE ON ALGORITHMIC LEARNING THEORY
, 2009
"... We present the first (to our knowledge) approximation algorithm for tensor clustering—a powerful generalization to basic 1D clustering. Tensors are increasingly common in modern applications dealing with complex heterogeneous data and clustering them is a fundamental tool for data analysis and patte ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
We present the first (to our knowledge) approximation algorithm for tensor clustering—a powerful generalization to basic 1D clustering. Tensors are increasingly common in modern applications dealing with complex heterogeneous data and clustering them is a fundamental tool for data analysis and pattern discovery. Akin to their 1D cousins, common tensor clustering formulations are NPhard to optimize. But, unlike the 1D case no approximation algorithms seem to be known. We address this imbalance and build on recent coclustering work to derive a tensor clustering algorithm with approximation guarantees, allowing metrics and divergences (e.g., Bregman) as objective functions. Therewith, we answer two open questions by Anagnostopoulos et al. (2008). Our analysis yields a constant approximation factor independent of data size; a worstcase example shows this factor to be tight for Euclidean coclustering. However, empirically the approximation factor is observed to be conservative, so our method can also be used in practice.
Robust overlapping coclustering
 Dept. of ECE, Univ. of Texas at Austin, IDEALTR09, Downloadable from http://www.lans.ece.utexas.edu/papers/ techreports/deodhar08ROCC.pdf
, 2008
"... Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. On such datasets, in order to accurately identify meaningful cluster ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. On such datasets, in order to accurately identify meaningful clusters, both noninformative data points and nondiscriminative features need to be discarded. Additionally, since clusters could exist in different subspaces of the feature space, a coclustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional “onesided” clustering. We propose Robust Overlapping Coclustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently detecting dense, arbitrarily positioned, possibly overlapping coclusters in a dataset. ROCC works with a large variety of distance measures and different cocluster definitions, making it applicable to a wide range of real life datasets. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful coclusters in microarray data as compared to several other prominent approaches proposed for this task. We also point out other interesting applications of the proposed framework in solving challenging
Sparse Biclustering of Transposable Data
, 2013
"... We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a biclusterspecific mean term and a common variance, and perform biclustering by maximizing the corresponding log likelihood. ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
We consider the task of simultaneously clustering the rows and columns of a large transposable data matrix. We assume that the matrix elements are normally distributed with a biclusterspecific mean term and a common variance, and perform biclustering by maximizing the corresponding log likelihood. We apply an ℓ1 penalty to the means of the biclusters in order to obtain sparse and interpretable biclusters. Our proposal amounts to a sparse, symmetrized version of kmeans clustering. We show that kmeans clustering of the rows and of the columns of a data matrix can be seen as special cases of our proposal, and that a relaxation of our proposal yields the singular value decomposition. In addition, we propose a framework for biclustering based on the matrixvariate normal distribution. The performances of our proposals are demonstrated in a simulation study and on a gene expression data set. This article has supplementary material online.
Hunting for Coherent Coclusters in High Dimensional and Noisy Datasets
 2008 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS
, 2008
"... Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of noninformative data points and f ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of noninformative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a coclustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional “onesided” clustering. We propose Robust Overlapping Coclustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping coclusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications. Through extensive experimentation we show that our approach is significantly more accurate in identifying biologically meaningful coclusters in microarray data as compared to several other prominent approaches that have been applied to this task. We also point out other interesting applications of the proposed framework in solving difficult clustering problems.
Information Theoretic Methods for Clustering with Applications to Microarray Data
 PhD, School of Electrical Engineering and Telecommunications, The University of New South
, 2010
"... COPYRIGHT STATEMENT ..."
(Show Context)
A PACBayesian Approach to Formulation of Clustering Objectives
"... Clustering is a widely used tool for exploratory data analysis. However, the theoretical understanding of clustering is very limited. We still do not have a wellfounded answer to the seemingly simple question of “how many clusters are present in the data?”, and furthermore a formal comparison of cl ..."
Abstract
 Add to MetaCart
(Show Context)
Clustering is a widely used tool for exploratory data analysis. However, the theoretical understanding of clustering is very limited. We still do not have a wellfounded answer to the seemingly simple question of “how many clusters are present in the data?”, and furthermore a formal comparison of clusterings based on different optimization objectives is far beyond our abilities. The lack of good theoretical support gives rise to multiple heuristics that confuse the practitioners and stall development of the field. We suggest that the illposed nature of clustering problems is caused by the fact that clustering is often taken out of its subsequent application context. We argue that one does not cluster the data just for the sake of clustering it, but rather to facilitate the solution of some higher level task. By evaluation of the clustering’s contribution to the solution of the higher level task it is possible to compare different clusterings, even those obtained by different optimization objectives. In the preceding work it was shown that such an approach can be applied to evaluation and design of coclustering solutions. Here we suggest that this approach can be extended to other settings, where clustering is applied. 1
A Scalable Framework for Discovering Coherent Coclusters in Noisy Data
"... Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of noninformative data points and f ..."
Abstract
 Add to MetaCart
(Show Context)
Clustering problems often involve datasets where only a part of the data is relevant to the problem, e.g., in microarray data analysis only a subset of the genes show cohesive expressions within a subset of the conditions/features. The existence of a large number of noninformative data points and features makes it challenging to hunt for coherent and meaningful clusters from such datasets. Additionally, since clusters could exist in different subspaces of the feature space, a coclustering algorithm that simultaneously clusters objects and features is often more suitable as compared to one that is restricted to traditional “onesided ” clustering. We propose Robust Overlapping CoClustering (ROCC), a scalable and very versatile framework that addresses the problem of efficiently mining dense, arbitrarily positioned, possibly overlapping coclusters from large, noisy datasets. ROCC has several desirable properties that make it extremely well suited to a number of real life applications. 1.
PACBayesian Analysis of Coclustering with Extensions to Matrix Trifactorization, Graph Clustering, Pairwise Clustering, and Graphical Models
 JOURNAL OF MACHINE LEARNING RESEARCH
"... This paper promotes a novel point of view on unsupervised learning. We argue that the goal of unsupervised learning is to facilitate a solution of some higher level task, and that it should be evaluated in terms of its contribution to the solution of this task. We present an example of such an analy ..."
Abstract
 Add to MetaCart
This paper promotes a novel point of view on unsupervised learning. We argue that the goal of unsupervised learning is to facilitate a solution of some higher level task, and that it should be evaluated in terms of its contribution to the solution of this task. We present an example of such an analysis for the case of coclustering, which is a widely used approach to the analysis of data matrices. This paper identifies two possible highlevel tasks in matrix data analysis: discriminative prediction of the missing entries and estimation of the joint probability distribution of row and column variables. We derive PACBayesian generalization bounds for the expected outofsample performance of coclusteringbased solutions for these two tasks. The analysis yields regularization terms that have not been part of previous formulations of coclustering. The bounds suggest that the expected performance of coclustering is governed by a tradeoff between its empirical performance and the mutual information preserved by the cluster variables on row and column IDs. We derive an iterative projection algorithm for finding a local optimum of this tradeoff for discriminative prediction tasks. This algorithm achieved stateoftheart performance