Results 1  10
of
44
Aggregating inconsistent information: ranking and clustering
, 2005
"... We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc ..."
Abstract

Cited by 226 (17 self)
 Add to MetaCart
We address optimization problems in which we are given contradictory pieces of input information and the goal is to find a globally consistent solution that minimizes the number of disagreements with the respective inputs. Specifically, the problems we address are rank aggregation, the feedback arc set problem on tournaments, and correlation and consensus clustering. We show that for all these problems (and various weighted versions of them), we can obtain improved approximation factors using essentially the same remarkably simple algorithm. Additionally, we almost settle a longstanding conjecture of BangJensen and Thomassen and show that unless NP⊆BPP, there is no polynomial time algorithm for the problem of minimum feedback arc set in tournaments.
Aggregation of partial rankings, pratings and topm lists
 ACMSIAM Symposium on Discrete Algorithms (SODA
, 2007
"... We study the problem of aggregating partial rankings. This problem is motivated by applications such as metasearching and information retrieval, search engine spam fighting, ecommerce, learning from experts, analysis of population preference sampling, committee decision making and more. We improve ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
(Show Context)
We study the problem of aggregating partial rankings. This problem is motivated by applications such as metasearching and information retrieval, search engine spam fighting, ecommerce, learning from experts, analysis of population preference sampling, committee decision making and more. We improve recent constant factor approximation algorithms for aggregation of full rankings and generalize them to partial rankings. Our algorithms improved constant factor approximation with respect to all metrics discussed in Fagin et al’s recent important work on comparing partial rankings. We pay special attention to two important types of partial rankings: the wellknown topm lists and the more general pratings which we define. We provide first evidence for hardness of aggregating them for constant m, p.
Deterministic pivoting algorithms for constrained ranking and Clustering Problems
, 2007
"... We consider ranking and clustering problems related to the aggregation of inconsistent information, in particular, rank aggregation, (weighted) feedback arc set in tournaments, consensus and correlation clustering, and hierarchical clustering. Ailon, Charikar, and Newman [4], Ailon and Charikar [3], ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
We consider ranking and clustering problems related to the aggregation of inconsistent information, in particular, rank aggregation, (weighted) feedback arc set in tournaments, consensus and correlation clustering, and hierarchical clustering. Ailon, Charikar, and Newman [4], Ailon and Charikar [3], and Ailon [2] proposed randomized constant factor approximation algorithms for these problems, which recursively generate a solution by choosing a random vertex as “pivot ” and dividing the remaining vertices into two groups based on the pivot vertex. In this paper, we answer an open question in these works by giving deterministic approximation algorithms for these problems. The analysis of our algorithms is simpler than the analysis of the randomized algorithms in [4], [3] and [2]. In addition, we consider the problem of finding minimumcost rankings and clusterings which must obey certain constraints (e.g. an input partial order in the case of ranking problems), which were introduced by Hegde and Jain [25] (see also [34]). We show that the first type of algorithms we propose can also handle these constrained problems. In addition, we show that in the case of a rank aggregation or consensus clustering problem, if the input rankings or clusterings obey the constraints, then we can always ensure that the output of
Abstract Consensus Clustering Algorithms: Comparison and Refinement
"... Consensus clustering is the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. Cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NPcomplete. A nu ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
(Show Context)
Consensus clustering is the problem of reconciling clustering information about the same data set coming from different sources or from different runs of the same algorithm. Cast as an optimization problem, consensus clustering is known as median partition, and has been shown to be NPcomplete. A number of heuristics have been proposed as approximate solutions, some with performance guarantees. In practice, the problem is apparently easy to approximate, but guidance is necessary as to which heuristic to use depending on the number of elements and clusterings given. We have implemented a number of heuristics for the consensus clustering problem, and here we compare their performance, independent of data size, in terms of efficacy and efficiency, on both simulated and real data sets. We find that based on the underlying algorithms and their behavior in practice the heuristics can be categorized into two distinct groups, with ramification as to which one to use in a given situation, and that a hybrid solution is the best bet in general. We have also developed a refined consensus clustering heuristic for the occasions when the given clusterings may be too disparate, and their consensus may not be representative of any one of them, and we show that in practice the refined consensus clusterings can be much superior to the general consensus clustering. 1
Fitting tree metrics: Hierarchical clustering and phylogeny
 In Proceedings of the Symposium on Foundations of Computer Science
, 2005
"... Given dissimilarity data on pairs of objects in a set, we study the problem of fitting a tree metric to this data so as to minimize additive error (i.e. some measure of the difference between the tree metric and the given data). This problem arises in constructing an Mlevel hierarchical clustering ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
(Show Context)
Given dissimilarity data on pairs of objects in a set, we study the problem of fitting a tree metric to this data so as to minimize additive error (i.e. some measure of the difference between the tree metric and the given data). This problem arises in constructing an Mlevel hierarchical clustering of objects (or an ultrametric on objects) so as to match the given dissimilarity data – a basic problem in statistics. Viewed in this way, the problem is a generalization of the correlation clustering problem (which corresponds to M = 1). We give a very simple randomized combinatorial algorithm for the Mlevel hierarchical clustering problem that achieves an approximation ratio of M +2. This is a generalization of a previous factor 3 algorithm for correlation clustering on complete graphs. The problem of fitting tree metrics also arises in phylogeny where the objective is to learn the evolution tree by fitting a tree to dissimilarity data on taxa. The quality of the fit is measured by taking the ℓp norm of the difference between the tree metric constructed and the given data. Previous results obtained a factor 3 approximation for finding the closest tree tree metric under the ℓ ∞ norm. No nontrivial approximation for general ℓp norms was known before. We present a novel LP formulation for this problem and obtain an O((log n log log n) 1/p) approximation using this. En route, we obtain an O((log n log log n) 1/p) approximation for the closest ultrametric under the ℓp norm. Our techniques are based on representing and viewing an ultrametric as a hierarchy of clusterings, and may be useful in other contexts. ∗ Partially supported by a Charlotte Elizabeth Procter Fellowship. Part of this work was done while visiting Microsoft Research.
Heterogeneous Data Integration with the Consensus Clustering Formalism
 Proceedings of Data Integration in the Life Sciences
, 2004
"... Abstract. Meaningfully integrating massive multiexperimental genomic data sets is becoming critical for the understanding of gene function. We have recently proposed methodologies for integrating large numbers of microarray data sets based on consensus clustering. Our methods combine gene clusters ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Meaningfully integrating massive multiexperimental genomic data sets is becoming critical for the understanding of gene function. We have recently proposed methodologies for integrating large numbers of microarray data sets based on consensus clustering. Our methods combine gene clusters into a unified representation, or a consensus, that is insensitive to misclassifications in the individual experiments. Here we extend their utility to heterogeneous data sets and focus on their refinement and improvement. First of all we compare our best heuristic to the popular majority rule consensus clustering heuristic, and show that the former yields tighter consensuses. We propose a refinement to our consensus algorithm by clustering of the sourcespecific clusterings as a step before finding the consensus between them, thereby improving our original results and increasing their biological relevance. We demonstrate our methodology on three data sets of yeast with biologically interesting results. Finally, we show that our methodology can deal successfully with missing experimental values. 1
Heterogeneous Source Consensus Learning via Decision Propagation and Negotiation ∗
"... Nowadays, enormous amounts of data are continuously generated not only in massive scale, but also from different, sometimes conflicting, views. Therefore, it is important to consolidate different concepts for intelligent decision making. For example, to predict the research areas of some people, the ..."
Abstract

Cited by 13 (6 self)
 Add to MetaCart
(Show Context)
Nowadays, enormous amounts of data are continuously generated not only in massive scale, but also from different, sometimes conflicting, views. Therefore, it is important to consolidate different concepts for intelligent decision making. For example, to predict the research areas of some people, the best results are usually achieved by combining and consolidating predictions obtained from the publication network, coauthorship network and the textual content of their publications. Multiple supervised and unsupervised hypotheses can be drawn from these information sources, and negotiating their differences and consolidating decisions usually yields a much more accurate model due to the diversity and heterogeneity of these models. In this paper, we address the problem of “consensus learning ” among competing hypotheses, which either rely on outside knowledge (supervised learning) or internal structure (unsupervised clustering). We argue that consensus learning is an NPhard problem and thus propose to solve it by an efficient heuristic method. We construct a belief graph to first propagate predictions from supervised models to the unsupervised, and then negotiate and reach consensus among them. Their final decision is further consolidated by calculating each model’s weight based on its degree of consistency with other models. Experiments are conducted on 20 Newsgroups data, Cora research papers, DBLP authorconference network, and Yahoo! Movies datasets, and the results show that the proposed method improves the classification accuracy and the clustering quality measure (NMI) over the best base model by up to 10%. Furthermore, it runs in time proportional to the number of instances, which is very efficient for largescale data sets.
Deterministic approximation algorithms for ranking and clusterings
, 2005
"... We give deterministic versions of randomized approximation algorithms for several ranking and clustering problems that were proposed by Ailon, Charikar and Newman[1]. We show that under a reasonable extension of the triangle inequality in clustering problems, we can resolve Ailon et al.’s open quest ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
We give deterministic versions of randomized approximation algorithms for several ranking and clustering problems that were proposed by Ailon, Charikar and Newman[1]. We show that under a reasonable extension of the triangle inequality in clustering problems, we can resolve Ailon et al.’s open question whether there is an approximation algorithm for weighted correlation clustering with weights satisfying the triangle inequality. 1
Correlation Clustering Revisited: The “True ” Cost of Error Minimization Problems
"... Correlation Clustering was defined by Bansal, Blum, and Chawla as the problem of clustering a set of elements based on a possibly inconsistent binary similarity function between element pairs. Their setting is agnostic in the sense that a ground truth clustering is not assumed to exist, and the only ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
(Show Context)
Correlation Clustering was defined by Bansal, Blum, and Chawla as the problem of clustering a set of elements based on a possibly inconsistent binary similarity function between element pairs. Their setting is agnostic in the sense that a ground truth clustering is not assumed to exist, and the only reasonable way to measure the cost of a solution is by comparing it with the input similarity function. This problem has been studied in theory and application and has been subsequently proven to be APXHard. In this work we assume that there does exist an unknown correct clustering of the data. This is the case in applications such as record linkage in databases. In this setting, we argue that it is more reasonable to measure accuracy of the output clustering against the unknown underlying true clustering. This corresponds to the intuition that in real life an action is penalized or rewarded based on reality and not on our noisy perception thereof. The traditional combinatorial optimization version of the problem only offers an indirect solution to our revisited version via a triangle inequality argument applied to the distances between the output clustering, the input similarity function and the underlying ground truth. In the revisited version, we show that it is possible to shortcut the traditional optimization detour and obtain a factor 2 approximation. This factor could not have possibly been obtained by using a solution