Results 1  10
of
69
Rank Aggregation Methods for the Web
, 2010
"... We consider the problem of combining ranking results from various sources. In the context of the Web, the main applications include building metasearch engines, combining ranking functions, selecting documents based on multiple criteria, and improving search precision through word associations. Wed ..."
Abstract

Cited by 478 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of combining ranking results from various sources. In the context of the Web, the main applications include building metasearch engines, combining ranking functions, selecting documents based on multiple criteria, and improving search precision through word associations. Wedevelop a set of techniques for the rank aggregation problem and compare their performance to that of wellknown methods. A primary goal of our work is to design rank aggregation techniques that can effectively combat "spam," a serious problem in Web searches. Experiments show that our methods are simple, efficient, and effective.
Model selection and accounting for model uncertainty in graphical models using Occam's window
, 1993
"... We consider the problem of model selection and accounting for model uncertainty in highdimensional contingency tables, motivated by expert system applications. The approach most used currently is a stepwise strategy guided by tests based on approximate asymptotic Pvalues leading to the selection o ..."
Abstract

Cited by 370 (47 self)
 Add to MetaCart
We consider the problem of model selection and accounting for model uncertainty in highdimensional contingency tables, motivated by expert system applications. The approach most used currently is a stepwise strategy guided by tests based on approximate asymptotic Pvalues leading to the selection of a single model; inference is then conditional on the selected model. The sampling properties of such a strategy are complex, and the failure to take account of model uncertainty leads to underestimation of uncertainty about quantities of interest. In principle, a panacea is provided by the standard Bayesian formalism which averages the posterior distributions of the quantity of interest under each of the models, weighted by their posterior model probabilities. Furthermore, this approach is optimal in the sense of maximising predictive ability. However, this has not been used in practice because computing the posterior model probabilities is hard and the number of models is very large (often greater than 1011). We argue that the standard Bayesian formalism is unsatisfactory and we propose an alternative Bayesian approach that, we contend, takes full account of the true model uncertainty byaveraging overamuch smaller set of models. An efficient search algorithm is developed for nding these models. We consider two classes of graphical models that arise in expert systems: the recursive causal models and the decomposable
Data Clustering: 50 Years Beyond KMeans
, 2008
"... Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and m ..."
Abstract

Cited by 294 (7 self)
 Add to MetaCart
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into taxonomic ranks: domain, kingdom, phylum, class, etc.). Cluster analysis is the formal study of algorithms and methods for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is exploratory in nature to find structure in data. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, Kmeans, was first published in 1955. In spite of the fact that Kmeans was proposed over 50 years ago and thousands of clustering algorithms have been published since then, Kmeans is still widely used. This speaks to the difficulty of designing a general purpose clustering algorithm and the illposed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semisupervised clustering, ensemble clustering, simultaneous feature selection, and data clustering and large scale data clustering.
Comparing top k lists
 In Proceedings of the ACMSIAM Symposium on Discrete Algorithms
, 2003
"... Motivated by several applications, we introduce various distance measures between “top k lists.” Some of these distance measures are metrics, while others are not. For each of these latter distance measures, we show that they are “almost ” a metric in the following two seemingly unrelated aspects: ( ..."
Abstract

Cited by 272 (4 self)
 Add to MetaCart
(Show Context)
Motivated by several applications, we introduce various distance measures between “top k lists.” Some of these distance measures are metrics, while others are not. For each of these latter distance measures, we show that they are “almost ” a metric in the following two seemingly unrelated aspects: (i) they satisfy a relaxed version of the polygonal (hence, triangle) inequality, and (ii) there is a metric with positive constant multiples that bound our measure above and below. This is not a coincidence—we show that these two notions of almost being a metric are formally identical. Based on the second notion, we define two distance measures to be equivalent if they are bounded above and below by constant multiples of each other. We thereby identify a large and robust equivalence class of distance measures. Besides the applications to the task of identifying good notions of (dis)similarity between two top k lists, our results imply polynomialtime constantfactor approximation algorithms for the rank aggregation problem [DKNS01] with respect to a large class of distance measures. To appear in SIAM J. on Discrete Mathematics. Extended abstract to appear in 2003 ACMSIAM Symposium on Discrete Algorithms (SODA ’03).
Fault Localization with Nearest Neighbor Queries
, 2003
"... We present a method for performing fault localization using similar program spectra. Our method assumes the existence of a faulty run and a larger number of correct runs. It then selects according to a distance criterion the correct run that most resembles the faulty run, compares the spectra corres ..."
Abstract

Cited by 234 (2 self)
 Add to MetaCart
We present a method for performing fault localization using similar program spectra. Our method assumes the existence of a faulty run and a larger number of correct runs. It then selects according to a distance criterion the correct run that most resembles the faulty run, compares the spectra corresponding to these two runs, and produces a report of "suspicious" parts of the program. Our method is widely applicable because it does not require any knowledge of the program input and no more information from the user than a classification of the runs as either "correct" or "faulty". To experimentally validate the viability of the method, we implemented it in a tool, WHITHER using basic block profiling spectra. We experimented with two different similarity measures and the Siemens suite of 132 programs with injected bugs. To measure the success of the tool, we developed a generic method for establishing the quality of a report. The method is based on the way an "ideal user" would navigate the program using the report to save effort during debugging. The best results we obtained were, on average, above 50%, meaning that our ideal user would avoid looking at half of the program.
What do we know about the Metropolis algorithm
 J. Comput. System. Sci
, 1998
"... The Metropolis algorithm is a widely used procedure for sampling from a specified distribution on a large finite set. We survey what is rigorously known about running times. This includes work from statistical physics, computer science, probability and statistics. Some new results are given ae an il ..."
Abstract

Cited by 87 (14 self)
 Add to MetaCart
The Metropolis algorithm is a widely used procedure for sampling from a specified distribution on a large finite set. We survey what is rigorously known about running times. This includes work from statistical physics, computer science, probability and statistics. Some new results are given ae an illustration of the geometric theory of Markov chains. 1. Introduction. Let % be a finite set and m(~)> 0 a probability distribution on %. The Metropolis algorithm is a procedure for drawing samples from X. It was introduced by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller [1953]. The algorithm requires the user to specify a connected, aperiodic Markov chain 1<(z, y) on %. This chain need not be symmetric but if K(z, y)>0, one needs 1<(Y, z)>0. The chain K is modified
Cranking: Combining Rankings Using Conditional Probability Models on Permutations
 In Proceedings of the 19th International Conference on Machine Learning
, 2002
"... A new approach to ensemble learning is introduced that takes ranking rather than classification as fundamental, leading to models on the symmetric group and its cosets. The approach uses a generalization of the Mallows model on permutations to combine multiple input rankings. Applications incl ..."
Abstract

Cited by 53 (1 self)
 Add to MetaCart
A new approach to ensemble learning is introduced that takes ranking rather than classification as fundamental, leading to models on the symmetric group and its cosets. The approach uses a generalization of the Mallows model on permutations to combine multiple input rankings. Applications include the task of combining the output of multiple search engines and multiclass or multilabel classification, where a set of input classifiers is viewed as generating a ranking of class labels.
The Markov chain Monte Carlo revolution
, 2008
"... The use of simulation for highdimensional intractable computations has revolutionized applied mathematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through microlocal analysis. ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
(Show Context)
The use of simulation for highdimensional intractable computations has revolutionized applied mathematics. Designing, improving and understanding the new tools leads to (and leans on) fascinating mathematics, from representation theory through microlocal analysis.
Merging the results of approximate match operations
 In VLDB
, 2004
"... Data Cleaning is an important process that has been at the center of research interest in recent years. An important end goal of effective data cleaning is to identify the relational tuple or tuples that are “most related ” to a given query tuple. Various techniques have been proposed in the literat ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
Data Cleaning is an important process that has been at the center of research interest in recent years. An important end goal of effective data cleaning is to identify the relational tuple or tuples that are “most related ” to a given query tuple. Various techniques have been proposed in the literature for efficiently identifying approximate matches to a query string against a single attribute of a relation. In addition to constructing a ranking (i.e., ordering) of these matches, the techniques often associate, with each match, scores that quantify the extent of the match. Since multiple attributes could exist in the query tuple, issuing approximate match operations for each of them separately will effectively create a number of ranked lists of the relation tuples. Merging these lists to identify a final ranking and scoring, and returning the topK tuples, is a challenging task. In this paper, we adapt the wellknown footrule distance (for merging ranked lists) to effectively deal with scores. We study efficient algorithms to merge rankings, and produce the topK tuples, in a declarative way. Since techniques for approximately matching a query string against a single attribute in a relation are typically best deployed in a database, we introduce and describe two novel algorithms for this problem and we provide SQL specifications for them. Our experimental case study, using real application data along with a realization of our proposed techniques on a commercial data base system, highlights the benefits of the proposed algorithms and attests to the overall effectiveness and practicality of our approach. 1
Ordinal measures for visual correspondence
 Columbia Univ. Center for
, 1996
"... We present ordinal measures for establishing ima e correspondence. Linear correspondence measures d e correlation and the sum of squared differences are known to be fragile. Ordinal measures, which are based on relative ordering of intensit values an windows, have demonstrable robustness to Apth dis ..."
Abstract

Cited by 36 (5 self)
 Add to MetaCart
We present ordinal measures for establishing ima e correspondence. Linear correspondence measures d e correlation and the sum of squared differences are known to be fragile. Ordinal measures, which are based on relative ordering of intensit values an windows, have demonstrable robustness to Apth discontinuities, occlusion and noise. The relative ordering of intensaty values in each window as represented by a rank permutation which is obtained by sortin the corresponding intensity data. By uszng a novel &stance metric between the rank permutations, we arrive at ordinal correlation coefficients. These coefficients are independent of absolute intensity scale, i.e they are normalized measures. Further, since rank permutations are invariant to monotone transformations of the intensity values, the coefficients are unaffected by nonlinear effects like gamma variation between images. We have developed a simple dgomthm for their eficient implementation. Experiments suggest the superiority of ordinal measures over existing techni ues under nonideal conditions. Though we present orjanal measures in the context o stereo, they serue as a a eneral tool for image matc f ing that is applicable to otter vision problems such as motion estimation and image registratton. 1