Results 1  10
of
478
Survey of clustering algorithms
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 2005
"... Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the ..."
Abstract

Cited by 483 (4 self)
 Add to MetaCart
(Show Context)
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed.
Graph mining: laws, generators, and algorithms
 ACM COMPUT SURV (CSUR
, 2006
"... How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in ..."
Abstract

Cited by 130 (7 self)
 Add to MetaCart
How does the Web look? How could we tell an abnormal social network from a normal one? These and similar questions are important in many fields where the data can intuitively be cast as a graph; examples range from computer networks to sociology to biology and many more. Indeed, any M: N relation in database terminology can be represented as a graph. A lot of these questions boil down to the following: “How can we generate synthetic but realistic graphs? ” To answer this, we must first understand what patterns are common in realworld graphs and can thus be considered a mark of normality/realism. This survey give an overview of the incredible variety of work that has been done on these problems. One of our main contributions is the integration of points of view from physics, mathematics, sociology, and computer science. Further, we briefly describe recent advances on some related and interesting graph problems.
Nonsmooth nonnegative matrix factorization (nsnmf
 IEEE transactions on
, 2006
"... Abstract—We propose a novel nonnegative matrix factorization model that aims at finding localized, partbased, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted “nonsmooth nonnegative matrix factoriz ..."
Abstract

Cited by 64 (4 self)
 Add to MetaCart
(Show Context)
Abstract—We propose a novel nonnegative matrix factorization model that aims at finding localized, partbased, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted “nonsmooth nonnegative matrix factorization ” (nsNMF), corresponds to the optimization of an unambiguous cost function designed to explicitly represent sparseness, in the form of nonsmoothness, which is controlled by a single parameter. In general, this method produces a set of basis and encoding vectors that are not only capable of representing the original data, but they also extract highly localized patterns, which generally lend themselves to improved interpretability. The properties of this new method are illustrated with several data sets. Comparisons to previously published methods show that the new nsNMF method has some advantages in keeping faithfulness to the data in the achieving a high degree of sparseness for both the estimated basis and the encoding vectors and in better interpretability of the factors. Index Terms—nonnegative matrix factorization, constrained optimization, datamining, mining methods and algorithms, pattern analysis, feature extraction or construction, sparse, structured, and very large systems. æ 1
Multiway distributional clustering via pairwise interactions
 In ICML
, 2005
"... We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in cooccurrence data. In this scheme, multiple clustering systems are generated aiming at maximi ..."
Abstract

Cited by 62 (10 self)
 Add to MetaCart
(Show Context)
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in cooccurrence data. In this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables. To implement this idea, we propose an algorithm that interleaves topdown clustering of some variables and bottomup clustering of the other variables, with a local optimization correction routine. Focusing on document clustering we present an extensive empirical study of twoway, threeway and fourway applications of our scheme using six realworld datasets including the 20 Newsgroups (20NG) and the Enron email collection. Our multiway distributional clustering (MDC) algorithms consistently and significantly outperform previous stateoftheart information theoretic clustering algorithms. 1.
Disco: Distributed coclustering with mapreduce. ICDM
, 2008
"... Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting realworld applications produce huge volumes of messy data. The mining process involves several steps, starting from preprocessing the raw data ..."
Abstract

Cited by 50 (1 self)
 Add to MetaCart
(Show Context)
Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting realworld applications produce huge volumes of messy data. The mining process involves several steps, starting from preprocessing the raw data to estimating the final models. As data become more abundant, scalable and easytouse tools for distributed processing are also emerging. Among those, MapReduce has been widely embraced by both academia and industry. In database terms, MapReduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying MapReduce, from raw data to final models, on an important mining task. In particular, we focus on coclustering, which has been studied in many applications such as text mining, collaborative filtering, bioinformatics, graph mining. We propose the Distributed Coclustering (DisCo) framework, which introduces practical approaches for distributed data preprocessing, and coclustering. We develop DisCo using Hadoop, an open source MapReduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware. 1
Modelbased overlapping clustering
 In KDD
, 2005
"... While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model prop ..."
Abstract

Cited by 41 (7 self)
 Add to MetaCart
(Show Context)
While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20Newsgroups and EachMovie datasets.
The discrete basis problem
, 2005
"... We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem ..."
Abstract

Cited by 38 (13 self)
 Add to MetaCart
(Show Context)
We consider the Discrete Basis Problem, which can be described as follows: given a collection of Boolean vectors find a collection of k Boolean basis vectors such that the original vectors can be represented using disjunctions of these basis vectors. We show that the decision version of this problem is NPcomplete and that the optimization version cannot be approximated within any finite ratio. We also study two variations of this problem, where the Boolean basis vectors must be mutually otrhogonal. We show that the other variation is closely related with the wellknown Metric kmedian Problem in Boolean space. To solve these problems, two algorithms will be presented. One is designed for the variations mentioned above, and it is solely based on solving the kmedian problem, while another is a heuristic intended to solve the general Discrete Basis Problem. We will also study the results of extensive experiments made with these two algorithms with both synthetic and realworld data. The results are twofold: with the synthetic data, the algorithms did rather well, but with the realworld data the results were not as good.
MJ: TRICLUSTER: an effective algorithm for mining coherent clusters in 3Dmicroarray data
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international
"... In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di®erent parameter values, it can mine di®erent types of clusters, ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
(Show Context)
In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di®erent parameter values, it can mine di®erent types of clusters, including those with constant or similar values along each dimension, as well as scaling and shifting expression patterns. triCluster relies on graphbased approach to mine all valid clusters. For each time slice, i.e., a gene£sample matrix, it constructs the range multigraph, a compact representation of all similar value ranges between any two sample columns. It then searches for constrained maximal cliques in this multigraph to yield the set of biclusters for this time slice. Then triCluster constructs another graph using the biclusters (as vertices) from each time slice; mining cliques from this graph yields the ¯nal set of triclusters. Optionally, triCluster merges/deletes some clusters having large overlaps. We present a useful set of metrics to evaluate the clustering quality, and we show that triCluster can ¯nd signi¯cant triclusters in the real microarray datasets. 1.
Multiway clustering on relation graphs
 In Proc. of the 7th SIAM Intl. Conf. on Data Mining
, 2006
"... A number of realworld domains such as social networks and ecommerce involve heterogeneous data that describes relations between multiple classes of entities. Understanding the natural structure of this type of heterogeneous relational data is essential both for exploratory analysis and for perform ..."
Abstract

Cited by 35 (3 self)
 Add to MetaCart
(Show Context)
A number of realworld domains such as social networks and ecommerce involve heterogeneous data that describes relations between multiple classes of entities. Understanding the natural structure of this type of heterogeneous relational data is essential both for exploratory analysis and for performing various predictive modeling tasks. In this paper, we propose a principled multiway clustering framework for relational data, wherein different types of entities are simultaneously clustered based not only on their intrinsic attribute values, but also on the multiple relations between the entities. To achieve this, we introduce a relation graph model that describes all the known relations between the different entity classes, in which each relation between a given set of entity classes is represented in the form of multimodal tensor over an appropriate domain. Our multiway clustering formulation is driven by the objective of capturing the maximal “information ” in the original relation graph, i.e., accurately approximating the set of tensors corresponding to the various relations. This formulation is applicable to all Bregman divergences (a broad family of loss functions that includes squared Euclidean distance, KLdivergence), and also permits analysis of mixed data types using convex combinations of appropriate Bregman loss functions. Furthermore, we present a large family of structurally different multiway clustering schemes that preserve various linear summary statistics of the original data. We accomplish the above generalizations by extending a recently proposed key theoretical result, namely the minimum Bregman information principle [1], to the relation graph setting. We also describe an efficient multiway clustering algorithm based on alternate minimization that generalizes a number of other recently proposed clustering methods. Empirical results on datasets obtained from realworld domains (e.g., movie recommendations, newsgroup articles) demonstrate the generality and efficacy of our framework. 1
Predictive discrete latent factor models for large scale dyadic data
 In KDD ’07
, 2007
"... We propose a novel statistical method to predict large scale dyadic response variables in the presence of covariate information. Our approach simultaneously incorporates the effect of covariates and estimates local structure that is induced by interactions among the dyads through a discrete latent f ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
(Show Context)
We propose a novel statistical method to predict large scale dyadic response variables in the presence of covariate information. Our approach simultaneously incorporates the effect of covariates and estimates local structure that is induced by interactions among the dyads through a discrete latent factor model. The discovered latent factors provide a predictive model that is both accurate and interpretable. We illustrate our method by working in a framework of generalized linear models, which include commonly used regression techniques like linear regression, logistic regression and Poisson regression as special cases. We also provide scalable generalized EMbased algorithms for model fitting using both "hard" and "soft " cluster assignments. We demonstrate the generality and efficacy of our approach through large scale simulation studies and analysis of datasets obtained from certain realworld movie recommendation and internet advertising applications.