Results 1  10
of
119
Aggregating inconsistent information: ranking and clustering
, 2005
"... We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc ..."
Abstract

Cited by 221 (17 self)
 Add to MetaCart
We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a longstanding conjecture of BangJensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
Clustering aggregation
 in ICDE 2005, 2005
"... We consider the following problem: given a set of clusterings, find a clustering that agrees as much as possible with the given clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the problem: each cat ..."
Abstract

Cited by 109 (1 self)
 Add to MetaCart
(Show Context)
We consider the following problem: given a set of clusterings, find a clustering that agrees as much as possible with the given clusterings. This problem, clustering aggregation, appears naturally in various contexts. For example, clustering categorical data is an instance of the problem: each categorical variable can be viewed as a clustering of the input rows. Moreover, clustering aggregation can be used as a metaclustering method to improve the robustness of clusterings. The problem formulation does not require apriori information about the number of clusters, and it gives a natural way for handling missing values. We give a formal statement of the clusteringaggregation problem, we discuss related work, and we suggest a number of algorithms. For several of the methods we provide theoretical guarantees on the quality of the solutions. We also show how sampling can be used to scale the algorithms for large data sets. We give an extensive empirical evaluation demonstrating the usefulness of the problem and of the solutions. 1
On the Hardness of Approximating Multicut and SparsestCut
 In Proceedings of the 20th Annual IEEE Conference on Computational Complexity
, 2005
"... We show that the MULTICUT, SPARSESTCUT, and MIN2CNF ≡ DELETION problems are NPhard to approximate within every constant factor, assuming the Unique Games Conjecture of Khot [STOC, 2002]. A quantitatively stronger version of the conjecture implies inapproximability factor of Ω(log log n). 1. ..."
Abstract

Cited by 99 (5 self)
 Add to MetaCart
(Show Context)
We show that the MULTICUT, SPARSESTCUT, and MIN2CNF ≡ DELETION problems are NPhard to approximate within every constant factor, assuming the Unique Games Conjecture of Khot [STOC, 2002]. A quantitatively stronger version of the conjecture implies inapproximability factor of Ω(log log n). 1.
Semisupervised graph clustering: a kernel approach
, 2008
"... Semisupervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semisupervised clustering algorithms are designed for data represented as vectors. In this ..."
Abstract

Cited by 94 (3 self)
 Add to MetaCart
Semisupervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semisupervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vectorbased and graphbased approaches. We first show that a recentlyproposed objective function for semisupervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel kmeans objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel kmeans and several graph clustering objectives enables us to perform semisupervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semisupervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with nonlinear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current stateoftheart semisupervised algorithms on both vectorbased and graphbased data sets.
A divideandmerge methodology for clustering
 ACM Transactions on Database Systems
, 2005
"... We present a divideandmerge methodology for clustering a set of objects that combines a topdown “divide ” phase with a bottomup “merge ” phase. In contrast, previous algorithms use either topdown or bottomup methods for constructing a hierarchical clustering or produce a flat clustering using ..."
Abstract

Cited by 69 (7 self)
 Add to MetaCart
(Show Context)
We present a divideandmerge methodology for clustering a set of objects that combines a topdown “divide ” phase with a bottomup “merge ” phase. In contrast, previous algorithms use either topdown or bottomup methods for constructing a hierarchical clustering or produce a flat clustering using local search (e.g. kmeans). Our divide phase produces a tree whose leaves are the elements of the set. For this phase, we suggest an efficient spectral algorithm. The merge phase quickly finds the optimal partition that respects the tree for many natural objective functions, e.g., kmeans, mindiameter, minsum, correlation clustering, etc. We present a metasearch engine that clusters results from web searches. We also give empirical results on textbased data where the algorithm performs better than or competitively with existing clustering algorithms. 1
Quadratic forms on graphs
 Invent. Math
, 2005
"... We introduce a new graph parameter, called the Grothendieck constant of a graph G = (V, E), which is defined as the least constant K such that for every A: E → R, ..."
Abstract

Cited by 50 (10 self)
 Add to MetaCart
(Show Context)
We introduce a new graph parameter, called the Grothendieck constant of a graph G = (V, E), which is defined as the least constant K such that for every A: E → R,
Correlation clustering: Maximizing agreements via semidefinite programming
 In SODA
, 2004
"... Abstract We consider the Correlation Clustering problem introducedin [2]. Given a graph G = (V, E) where each edge is labeled ..."
Abstract

Cited by 48 (0 self)
 Add to MetaCart
Abstract We consider the Correlation Clustering problem introducedin [2]. Given a graph G = (V, E) where each edge is labeled
Largescale deduplication with constraints using dedupalog
 in: Proceedings of the 25th International Conference on Data Engineering (ICDE
"... Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication v ..."
Abstract

Cited by 47 (3 self)
 Add to MetaCart
(Show Context)
Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication venue”; iftwo paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalogstyle language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an adhoc domainspecific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms scale to very large datasets. We provide thorough experimental results over realworld data demonstrating the utility of our framework for highquality and scalable deduplication. I.
Clustering Partially Observed Graphs via Convex Optimization
"... This paper considers the problem of clustering a partially observed unweighted graph – i.e. one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nod ..."
Abstract

Cited by 42 (12 self)
 Add to MetaCart
(Show Context)
This paper considers the problem of clustering a partially observed unweighted graph – i.e. one where for some node pairs we know there is an edge between them, for some others we know there is no edge, and for the remaining we do not know whether or not there is an edge. We want to organize the nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and sparse across clusters. We take a novel yet natural approach to this problem, by focusing on finding the clustering that minimizes the number of ”disagreements ”i.e. the sum of the number of (observed) missing edges within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) lowrank matrix and an (unknown) sparse matrix from their partially observed sum. We show that our algorithm succeeds under certain natural assumptions on the optimal clustering and its disagreements. Our results significantly strengthen existing matrix splitting results for the special case of our clustering problem. Our results directly enhance solutions to the problem of Correlation Clustering (Bansal et al., 2002) with partial observations.
Correlation Clustering in General Weighted Graphs
 Theoretical Computer Science
, 2006
"... We consider the following general correlationclustering problem [1]: given a graph with real nonnegative edge weights and a 〈+〉/〈− 〉 edge labeling, partition the vertices into clusters to minimize the total weight of cut 〈+ 〉 edges and uncut 〈− 〉 edges. Thus, 〈+ 〉 edges with large weights (represen ..."
Abstract

Cited by 39 (0 self)
 Add to MetaCart
(Show Context)
We consider the following general correlationclustering problem [1]: given a graph with real nonnegative edge weights and a 〈+〉/〈− 〉 edge labeling, partition the vertices into clusters to minimize the total weight of cut 〈+ 〉 edges and uncut 〈− 〉 edges. Thus, 〈+ 〉 edges with large weights (representing strong correlations between endpoints) encourage those endpoints to belong to a common cluster while 〈− 〉 edges with large weights encourage the endpoints to belong to different clusters. In contrast to most clustering problems, correlation clustering specifies neither the desired number of clusters nor a distance threshold for clustering; both of these parameters are effectively chosen to be the best possible by the problem definition. Correlation clustering was introduced by Bansal, Blum, and Chawla [1], motivated by both document clustering and agnostic learning. They proved NPhardness and gave constantfactor approximation algorithms for the special case in which the graph is complete (full information) and every edge has the same weight. We give an O(log n)approximation algorithm for the general case based on a linearprogramming rounding and the “regiongrowing ” technique. We also prove that this linear program has a gap of Ω(log n), and therefore our approximation is tight under this approach. We also give an O(r 3)approximation algorithm for Kr,rminorfree graphs. On the other hand, we show that the problem is equivalent to minimum multicut, and therefore APXhard and difficult to approximate better than Θ(logn). 1