Results 1  10
of
34
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 274 (6 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Relational Learning via Collective Matrix Factorization
, 2008
"... Relational learning is concerned with predicting unknown values of a relation, given a database of entities and observed relations among entities. An example of relational learning is movie rating prediction, where entities could include users, movies, genres, and actors. Relations would then encode ..."
Abstract

Cited by 127 (4 self)
 Add to MetaCart
(Show Context)
Relational learning is concerned with predicting unknown values of a relation, given a database of entities and observed relations among entities. An example of relational learning is movie rating prediction, where entities could include users, movies, genres, and actors. Relations would then encode users ’ ratings of movies, movies ’ genres, and actors ’ roles in movies. A common prediction technique given one pairwise relation, for example a #users × #movies ratings matrix, is lowrank matrix factorization. In domains with multiple relations, represented as multiple matrices, we may improve predictive accuracy by exploiting information from one relation while predicting another. To this end, we propose a collective matrix factorization model: we simultaneously factor several matrices, sharing parameters among factors when an entity participates in multiple relations. Each relation can have a different value type and error distribution; so, we allow nonlinear relationships between the parameters and outputs, using Bregman divergences to measure error. We extend standard alternating projection algorithms to our model, and derive an efficient Newton update for the projection. Furthermore, we propose stochastic optimization methods to deal with large, sparse matrices. Our model generalizes several existing matrix factorization methods, and therefore yields new largescale optimization algorithms for these problems. Our model can handle any pairwise relational schema and a
Rankingbased clustering of heterogeneous information networks with star network schema
 In: Proc. 2009 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2009
, 2009
"... A heterogeneous information network is an information network composed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on ..."
Abstract

Cited by 84 (30 self)
 Add to MetaCart
(Show Context)
A heterogeneous information network is an information network composed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in each cluster. However, although clustering on homogeneous networks has been studied over decades, clustering on heterogeneous networks has not been addressed until recently. A recent study proposed a new algorithm, RankClus, for clustering on bityped heterogeneous networks. However, a realworld network may consist of more than two types, and the interactions among multityped objects play a key role at disclosing the rich semantics that a network carries. In this paper, we study clustering of multityped heterogeneous networks with a star network schema and propose a novel algorithm, NetClus, that utilizes links across multityped objects to generate highquality netclusters. An iterative enhancement method is developed that leads to effective rankingbased clustering in such heterogeneous networks. Our experiments on DBLP data show that NetClus generates more accurate clustering results than the baseline topic model algorithm PLSA and the recently proposed algorithm, RankClus. Further, NetClus generates informative clusters, presenting good ranking and cluster membership information for each attribute object in each netcluster.
A Unified View of Matrix Factorization Models
"... Abstract. We present a unified view of matrix factorization that frames the differences among popular methods, such as NMF, Weighted SVD, EPCA, MMMF, pLSI, pLSIpHITS, Bregman coclustering, and many others, in terms of a small number of modeling choices. Many of these approaches can be viewed as m ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We present a unified view of matrix factorization that frames the differences among popular methods, such as NMF, Weighted SVD, EPCA, MMMF, pLSI, pLSIpHITS, Bregman coclustering, and many others, in terms of a small number of modeling choices. Many of these approaches can be viewed as minimizing a generalized Bregman divergence, and we show that (i) a straightforward alternating projection algorithm can be applied to almost any model in our unified view; (ii) the Hessian for each projection has special structure that makes a Newton projection feasible, even when there are equality constraints on the factors, which allows for matrix coclustering; and (iii) alternating projections can be generalized to simultaneously factor a set of matrices that share dimensions. These observations immediately yield new optimization algorithms for the above factorization methods, and suggest novel generalizations of these methods such as incorporating row and column biases, and adding or relaxing clustering constraints. 1
Allatonce Optimization for Coupled Matrix and Tensor Factorizations
, 1105
"... Joint analysis of data from multiple sources has the potential to improve our understanding of the underlying structures in complex data sets. For instance, in restaurant recommendation systems, recommendations can be based on rating histories of customers. In addition to rating histories, customers ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
(Show Context)
Joint analysis of data from multiple sources has the potential to improve our understanding of the underlying structures in complex data sets. For instance, in restaurant recommendation systems, recommendations can be based on rating histories of customers. In addition to rating histories, customers ’ social networks (e.g., Facebook friendships) and restaurant categories information (e.g., Thai or Italian) can also be used to make better recommendations. The task of fusing data, however, is challenging since data sets can be incomplete and heterogeneous, i.e., data consist of both matrices, e.g., the person by person social network matrix or the restaurant by category matrix, and higherorder tensors, e.g., the “ratings ” tensor of the form restaurant by meal by person. In this paper, we are particularly interested in fusing data sets with the goal of capturing their underlying latent structures. We formulate this problem as a coupled matrix and tensor factorization (CMTF) problem where heterogeneous data sets are modeled by fitting outerproduct models to higherorder tensors and matrices in a coupled manner. Unlike traditional approaches solving this problem using alternating algorithms, we propose an allatonce optimization approach called CMTFOPT (CMTFOPTimization), which is a gradientbased optimization approach for joint analysis of matrices and higherorder tensors. We also extend the algorithm to handle coupled incomplete data sets. Using numerical experiments, we demonstrate that the proposed allatonce approach is more accurate than the alternating least squares approach.
Graph regularized transductive classification on heterogeneous information networks
 In ECML PKDD
, 2010
"... Abstract. A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized that stronglytyped heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for some objects. Learning fr ..."
Abstract

Cited by 25 (9 self)
 Add to MetaCart
(Show Context)
Abstract. A heterogeneous information network is a network composed of multiple types of objects and links. Recently, it has been recognized that stronglytyped heterogeneous information networks are prevalent in the real world. Sometimes, label information is available for some objects. Learning from such labeled and unlabeled data via transductive classification can lead to good knowledge extraction of the hidden network structure. However, although classification on homogeneous networks has been studied for decades, classification on heterogeneous networks has not been explored until recently. In this paper, we consider the transductive classification problem on heterogeneous networked data which share a common topic. Only some objects in the given network are labeled, and we aim to predict labels for all types of the remaining objects. A novel graphbased regularization framework, GNetMine, is proposed to model the link structure in information networks with arbitrary network schema and arbitrary number of object/link types. Specifically, we explicitly respect the type differences by preserving consistency over each relation graph corresponding to each type of links separately. Efficient computational schemes are then introduced to solve the corresponding optimization problem. Experiments on the DBLP data set show that our algorithm significantly improves the classification accuracy over existing stateoftheart methods. 1
iLink: Search and Routing in Social Networks
"... The growth of Web 2.0 and fundamental theoretical breakthroughs have led to an avalanche of interest in social networks. This paper focuses on the problem of modeling how social networks accomplish tasks through peer production style collaboration. We propose a general interaction model for the unde ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
The growth of Web 2.0 and fundamental theoretical breakthroughs have led to an avalanche of interest in social networks. This paper focuses on the problem of modeling how social networks accomplish tasks through peer production style collaboration. We propose a general interaction model for the underlying social networks and then a specific model (iLink) for social search and message routing. A key contribution here is the development of a general learning framework for making such online peer production systems work at scale. The iLink model has been used to develop a system for FAQ generation in a social network (FAQtory), and experience with its application in the context of a fullscale learningdriven workflow application (CALO) is reported. We also discuss methods of adapting iLink technology for use in military knowledge sharing portals and a other message routing systems. Finally, the paper shows the connection of iLink to SQM, a theoretical model for social search that is a generalization of Markov Decision Processes and the popular Pagerank model.
Unifying Dependent Clustering and Disparate Clustering for Nonhomogeneous Data
"... Modern data mining settings involve a combination of attributevalued descriptors over entities as well as specified relationships between these entities. We present an approach to cluster such nonhomogeneous datasets by using the relationships to impose either dependent clustering or disparate clus ..."
Abstract

Cited by 16 (8 self)
 Add to MetaCart
(Show Context)
Modern data mining settings involve a combination of attributevalued descriptors over entities as well as specified relationships between these entities. We present an approach to cluster such nonhomogeneous datasets by using the relationships to impose either dependent clustering or disparate clustering constraints. Unlike prior work that views constraints as boolean criteria, we present a formulation that allows constraints to be satisfied or violated in a smooth manner. This enables us to achieve dependent clustering and disparate clustering using the same optimization framework by merely maximizing versus minimizing the objective function. We present results on both synthetic data as well as several realworld datasets.
A Classification for Community Discovery Methods in Complex Networks
, 2011
"... Many realworld networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature on what is called today community detection. S ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
Many realworld networks are intimately organized according to a community structure. Much research effort has been devoted to develop methods and algorithms that can efficiently highlight this hidden structure of a network, yielding a vast literature on what is called today community detection. Since network representation can be very complex and can contain different variants in the traditional graph model, each algorithm in the literature focuses on some of these properties and establishes, explicitly or implicitly, its own definition of community. According to this definition, each proposed algorithm then extracts the communities, which typically reflect only part of the features of real communities. The aim of this survey is to provide a ‘user manual’ for the community discovery problem. Given a meta definition of what a community in a social network is, our aim is to organize the main categories of community discovery methods based on the definition of community they adopt. Given a desired definition of community and the features of a problem (size of network, direction of edges, multidimensionality, and so on) this review paper is designed to provide a set of approaches that researchers could focus on. The proposed classification of community discovery methods is also useful for putting into perspective the many open
Approximation Algorithms for Bregman Clustering Coclustering and Tensor Clustering
, 2008
"... The Euclidean Kmeans problem is fundamental to clustering and over the years it has been intensely investigated. More recently, generalizations such as Bregman kmeans [8], coclustering [10], and tensor (multiway) clustering [40] have also gained prominence. A wellknown computational difficulty ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
The Euclidean Kmeans problem is fundamental to clustering and over the years it has been intensely investigated. More recently, generalizations such as Bregman kmeans [8], coclustering [10], and tensor (multiway) clustering [40] have also gained prominence. A wellknown computational difficulty encountered by these clustering problems is the NPHardness of the associated optimization task, and commonly used methods guarantee at most local optimality. Consequently, approximation algorithms of varying degrees of sophistication have been developed, though largely for the basic Euclidean Kmeans (or ℓ1norm Kmedian) problem. In this paper we present approximation algorithms for several Bregman clustering problems by building upon the recent paper of Arthur and Vassilvitskii [5]. Our algorithms obtain objective values within a factor O(log K) for Bregman kmeans, Bregman coclustering, Bregman tensor clustering, and weighted kernel kmeans. To our knowledge, except for some special cases, approximation algorithms have not been considered for these general clustering problems. There are several important implications of our work: (i) under the same assumptions as Ackermann et al. [2] it yields a much faster algorithm (nonexponential in K, unlike [2]) for informationtheoretic clustering, (ii) it answers several open problems posed by [4], including generalizations to Bregman coclustering, and tensor clustering, (iii) it provides practical and easy to implement methods—in contrast to several other common approximation approaches.