Results 1  10
of
90
Semisupervised graph clustering: a kernel approach
, 2008
"... Semisupervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semisupervised clustering algorithms are designed for data represented as vectors. In this ..."
Abstract

Cited by 94 (3 self)
 Add to MetaCart
Semisupervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semisupervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vectorbased and graphbased approaches. We first show that a recentlyproposed objective function for semisupervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel kmeans objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel kmeans and several graph clustering objectives enables us to perform semisupervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semisupervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with nonlinear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current stateoftheart semisupervised algorithms on both vectorbased and graphbased data sets.
Measuring constraintset utility for partitional clustering algorithms
 In: Proceedings of the Tenth European Conference on Principles and Practice of Knowledge Discovery in Databases
, 2006
"... Abstract. Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged ..."
Abstract

Cited by 49 (5 self)
 Add to MetaCart
(Show Context)
Abstract. Clustering with constraints is an active area of machine learning and data mining research. Previous empirical work has convincingly shown that adding constraints to clustering improves the performance of a variety of algorithms. However, in most of these experiments, results are averaged over different randomly chosen constraint sets from a given set of labels, thereby masking interesting properties of individual sets. We demonstrate that constraint sets vary significantly in how useful they are for constrained clustering; some constraint sets can actually decrease algorithm performance. We create two quantitative measures, informativeness and coherence, that can be used to identify useful constraint sets. We show that these measures can also help explain differences in performance for four particular constrained clustering algorithms. 1
Agglomerative Hierarchical Clustering with Constraints: Theoretical and Empirical Results
 Lecture notes in computer science
, 2005
"... Abstract. We explore the use of instance and clusterlevel constraints with agglomerative hierarchical clustering. Though previous work has illustrated the benefits of using constraints for nonhierarchical clustering, their application to hierarchical clustering is not straightforward for two prim ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We explore the use of instance and clusterlevel constraints with agglomerative hierarchical clustering. Though previous work has illustrated the benefits of using constraints for nonhierarchical clustering, their application to hierarchical clustering is not straightforward for two primary reasons. First, some constraint combinations make the feasibility problem (Does there exist a single feasible solution?) NPcomplete. Second, some constraint combinations when used with traditional agglomerative algorithms can cause the dendrogram to stop prematurely in a deadend solution even though there exist other feasible solutions with a significantly smaller number of clusters. When constraints lead to efficiently solvable feasibility problems and standard agglomerative algorithms do not give rise to deadend solutions, we empirically illustrate the benefits of using constraints to improve cluster purity and average distortion. Furthermore, we introduce the new γ constraint and use it in conjunction with the triangle inequality to considerably improve the efficiency of agglomerative clustering. 1
COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity
 in ICDM, 2006
"... Cluster analysis has long been a fundamental task in data mining and machine learning. However, traditional clustering methods concentrate on producing a single solution, even though multiple alternative clusterings may exist. It is thus difficult for the user to validate whether the given solution ..."
Abstract

Cited by 38 (4 self)
 Add to MetaCart
(Show Context)
Cluster analysis has long been a fundamental task in data mining and machine learning. However, traditional clustering methods concentrate on producing a single solution, even though multiple alternative clusterings may exist. It is thus difficult for the user to validate whether the given solution is in fact appropriate, particularly for large and complex datasets. In this paper we explore the critical requirements for systematically finding a new clustering, given that an already known clustering is available and we also propose a novel algorithm, COALA, to discover this new clustering. Our approach is driven by two important factors; dissimilarity and quality. These are especially important for finding a new clustering which is highly informative about the underlying structure of data, but is at the same time distinctively different from the provided clustering. We undertake an experimental analysis and show that our method is able to outperform existing techniques, for both synthetic and real datasets. 1.
2006. Partially supervised coreference resolution for opinion summarization through structured rule learning
 In Proceedings of EMNLP
, 2006
"... Combining finegrained opinion information to produce opinion summaries is important for sentiment analysis applications. Toward that end, we tackle the problem of source coreference resolution – linking together source mentions that refer to the same entity. The partially supervised nature of the p ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
Combining finegrained opinion information to produce opinion summaries is important for sentiment analysis applications. Toward that end, we tackle the problem of source coreference resolution – linking together source mentions that refer to the same entity. The partially supervised nature of the problem leads us to define and approach it as the novel problem of partially supervised clustering. We propose and evaluate a new algorithm for the task of source coreference resolution that outperforms competitive baselines. 1
Identifying and generating easy sets of constraints for clustering
 In AAAI ’06: Proceedings, The TwentyFirst National Conference on Artificial Intelligence and the Eighteenth Innovative Applications of Artificial Intelligence Conference
, 2006
"... Clustering under constraints is a recent innovation in the artificial intelligence community that has yielded significant practical benefit. However, recent work has shown that for some negative forms of constraints the associated subproblem of just finding a feasible clustering is NPcomplete. Th ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
Clustering under constraints is a recent innovation in the artificial intelligence community that has yielded significant practical benefit. However, recent work has shown that for some negative forms of constraints the associated subproblem of just finding a feasible clustering is NPcomplete. These worst case results for the entire problem class say nothing of where and how prevalent easy problem instances are. In this work, we show that there are large pockets within these problem classes where clustering under constraints is easy and that using easy sets of constraints yields better empirical results. We then illustrate several sufficient conditions from graph theory to identify a priori where these easy problem instances are and present algorithms to create large and easy to satisfy constraint sets.
Unifying Dependent Clustering and Disparate Clustering for Nonhomogeneous Data
"... Modern data mining settings involve a combination of attributevalued descriptors over entities as well as specified relationships between these entities. We present an approach to cluster such nonhomogeneous datasets by using the relationships to impose either dependent clustering or disparate clus ..."
Abstract

Cited by 16 (8 self)
 Add to MetaCart
(Show Context)
Modern data mining settings involve a combination of attributevalued descriptors over entities as well as specified relationships between these entities. We present an approach to cluster such nonhomogeneous datasets by using the relationships to impose either dependent clustering or disparate clustering constraints. Unlike prior work that views constraints as boolean criteria, we present a formulation that allows constraints to be satisfied or violated in a smooth manner. This enables us to achieve dependent clustering and disparate clustering using the same optimization framework by merely maximizing versus minimizing the objective function. We present results on both synthetic data as well as several realworld datasets.
Constrained Locally Weighted Clustering
 Proc. of International Conference on Very Large Data Bases (VLDB
, 2008
"... Data clustering is a difficult problem due to the complex and heterogeneous natures of multidimensional data. To improve clustering accuracy, we propose a scheme to capture the local correlation structures: associate each cluster with an independent weighting vector and embed it in the subspace spa ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Data clustering is a difficult problem due to the complex and heterogeneous natures of multidimensional data. To improve clustering accuracy, we propose a scheme to capture the local correlation structures: associate each cluster with an independent weighting vector and embed it in the subspace spanned by an adaptive combination of the dimensions. Our clustering algorithm takes advantage of the known pairwise instancelevel constraints. The data points in the constraint set are divided into groups through inference; and each group is assigned to the feasible cluster which minimizes the sum of squared distances between all the points in the group and the corresponding centroid. Our theoretical analysis shows that the probability of points being assigned to the correct clusters is much higher by the new algorithm, compared to the conventional methods. This is confirmed by our experimental results, indicating that our design indeed produces clusters which are closer to the ground truth than clusters created by the current stateoftheart algorithms. 1.
ORIGAMI: A Novel and Effective Approach for Mining Representative Orthogonal Graph Patterns
, 2008
"... In this paper, we introduce the concept of αorthogonal patterns to mine a representative set of graph patterns. Intuitively, two graph patterns are αorthogonal if their similarity is bounded above by α. Each αorthogonal pattern is also a representative for those patterns that are at least β simil ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
In this paper, we introduce the concept of αorthogonal patterns to mine a representative set of graph patterns. Intuitively, two graph patterns are αorthogonal if their similarity is bounded above by α. Each αorthogonal pattern is also a representative for those patterns that are at least β similar to it. Given user defined α, β ∈ [0, 1], the goal is to mine an αorthogonal, βrepresentative set that minimizes the set of unrepresented patterns. We present ORIGAMI, an effective algorithm for mining the set of representative orthogonal patterns. ORIGAMI first uses a randomized algorithm to randomly traverse the pattern space, seeking previously unexplored regions, to return a set of maximal patterns. ORIGAMI then extracts an αorthogonal, βrepresentative set from the mined maximal patterns. We show the effectiveness of our algorithm on a number of real and synthetic datasets. In particular, we show that our method is able to extract highquality patterns even in cases where existing enumerative graph mining methods fail to do so.
Efficient incremental constrained clustering
 In Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining
, 2007
"... Clustering with constraints is an emerging area of data mining research. However, most work assumes that the constraints are given as one large batch. In this paper we explore the situation where the constraints are incrementally given. In this way the user after seeing a clustering can provide p ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Clustering with constraints is an emerging area of data mining research. However, most work assumes that the constraints are given as one large batch. In this paper we explore the situation where the constraints are incrementally given. In this way the user after seeing a clustering can provide positive and negative feedback via constraints to critique a clustering solution. We consider the problem of efficiently updating a clustering to satisfy the new and old constraints rather than reclustering the entire data set. We show that the problem of incremental clustering under constraints is NPhard in general, but identify several sufficient conditions which lead to efficiently solvable versions. These translate into a set of rules on the types of constraints that can be added and constraint set properties that must be maintained. We demonstrate that this approach is more efficient than reclustering the entire data set and has several other advantages.