Results 1 - 10
of
10
LETOR: A Benchmark Collection for Research on Learning to Rank for Information Retrieval
"... LETOR is a benchmark collection for the research on learning to rank for information retrieval, released by Microsoft Research Asia. In this paper, we describe the details of the LETOR collection and show how it can be used in different kinds of researches. Specifically, we describe how the documen ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
LETOR is a benchmark collection for the research on learning to rank for information retrieval, released by Microsoft Research Asia. In this paper, we describe the details of the LETOR collection and show how it can be used in different kinds of researches. Specifically, we describe how the document corpora and query sets in LETOR are selected, how the documents are sampled, how the learning features and meta information are extracted, and how the datasets are partitioned for comprehensive evaluation. We then compare several state-of-the-art learning to rank algorithms on LETOR, report their ranking performances, and make discussions on the results. After that, we discuss possible new research topics that can be supported by LETOR, in addition to algorithm comparison. We hope that this paper can help people to gain deeper understanding of LETOR, and enable more interesting research projects on learning to rank and related topics.
How to Make LETOR More Useful and Reliable
- SIGIR2008WORKSHOPONLEARNINGTORANKFORINFORMATIONRETRIEVAL(LR4IR2008)
, 2008
"... Learning to rank has attracted great attention recently in both information retrieval and machine learning communities. However, the lack of public dataset had stood in its way until the LETOR benchmark dataset (actually a group of three datasets) was released in the SIGIR 2007 workshop on Learning ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Learning to rank has attracted great attention recently in both information retrieval and machine learning communities. However, the lack of public dataset had stood in its way until the LETOR benchmark dataset (actually a group of three datasets) was released in the SIGIR 2007 workshop on Learning to Rank for Information Retrieval (LR4IR 2007). Since then, this dataset has been widely used in many learning to rank papers, and has greatly speeded up the corresponding research. In this paper, we discuss how to further improve LETOR to make it more useful and reliable. First, we notice that some low-level information, such as the term frequency in each stream (title, body, url, anchor, etc.) and the stream length, are missing in the current feature set of LETOR. We propose adding the information to LETOR, so as to enable the reproduction or optimization of models like BM25. Second, we find that the sampling of documents associated with each query in LETOR was somehow biased. We therefore propose a new document sampling strategy to reduce the bias. Third, the scale (less than 100 queries) of LETOR is relatively small for real world ranking applications. We propose adding more queries to the current datasets in LETOR, and/or building even larger datasets by leveraging the effort of the entire information retrieval community.
Ranking with Ordered Weighted Pairwise Classification
"... In ranking with the pairwise classification approach, the loss associated to a predicted ranked list is the mean of the pairwise classification losses. This loss is inadequate for tasks like information retrieval where we prefer ranked lists with high precision on the top of the list. We propose to ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In ranking with the pairwise classification approach, the loss associated to a predicted ranked list is the mean of the pairwise classification losses. This loss is inadequate for tasks like information retrieval where we prefer ranked lists with high precision on the top of the list. We propose to optimize a larger class of loss functions for ranking, based on an ordered weighted average (OWA) (Yager, 1988) of the classification losses. Convex OWA aggregation operators range from the max to the mean depending on their weights, and can be used to focus on the top ranked elements as they give more weight to the largest losses. When aggregating hinge losses, the optimization problem is similar to the SVM for interdependent output spaces. Moreover, we show that OWA aggregates of marginbased classification losses have good generalization properties. Experiments on the Letor 3.0 benchmark dataset for information retrieval validate our approach. 1.
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Clickthrough Log Analysis by Collaborative Ranking
"... Analyzing clickthrough log data is important for improving search performance as well as understanding user behaviors. In this paper, we propose a novel collaborative ranking model to tackle two difficulties in analyzing clickthrough log. First, previous studies have shown that users tend to click t ..."
Abstract
- Add to MetaCart
Analyzing clickthrough log data is important for improving search performance as well as understanding user behaviors. In this paper, we propose a novel collaborative ranking model to tackle two difficulties in analyzing clickthrough log. First, previous studies have shown that users tend to click topranked results even they are less relevant. Therefore, we use pairwise ranking relation to avoid the position bias in clicks. Second, since click data are extremely sparse with respect to each query or user, we construct a collaboration model to eliminate the sparseness problem. We also find that the proposed model and previous popular used click-based models address different aspects of clickthrough log data. We further propose a hybrid model that can achieve significant improvement compared to the baselines on a large-scale real world dataset.
Cross-Market Model Adaptation with Pairwise Preference Data for Web Search Ranking
"... Machine-learned ranking techniques automatically learn a complex document ranking function given training data. These techniques have demonstrated the effectiveness and flexibility required of a commercial web search. However, manually labeled training data (with multiple absolute grades) has become ..."
Abstract
- Add to MetaCart
Machine-learned ranking techniques automatically learn a complex document ranking function given training data. These techniques have demonstrated the effectiveness and flexibility required of a commercial web search. However, manually labeled training data (with multiple absolute grades) has become the bottleneck for training a quality ranking function, particularly for a new domain. In this paper, we explore the adaptation of machine-learned ranking models across a set of geographically diverse markets with the market-specific pairwise preference data, which can be easily obtained from clickthrough logs. We propose a novel adaptation algorithm, Pairwise-Trada, which is able to adapt ranking models that are trained with multi-grade labeled training data to the target market using the target-market-specific pairwise preference data. We present results demonstrating the efficacy of our technique on a set of commercial search engine data. 1
Editor: blank
"... Learning to rank has attracted great attention recently in both information retrieval and machine learning communities. However, the lack of public dataset had stood in its way until the LETOR benchmark dataset (actually a group of three datasets) was released in the SIGIR 2007 workshop on Learning ..."
Abstract
- Add to MetaCart
Learning to rank has attracted great attention recently in both information retrieval and machine learning communities. However, the lack of public dataset had stood in its way until the LETOR benchmark dataset (actually a group of three datasets) was released in the SIGIR 2007 workshop on Learning to Rank for Information Retrieval (LR4IR 2007). Since then, this dataset has been widely used in many learning to rank papers, and has greatly speeded up the corresponding research. Recently, we released the latest version, LETOR3.0. LETOR3.0 makes a lot of improvement over previous two versions, and it is more useful and reliable. In this paper, we describe the details of datasets in LETOR3.0, including collection information, query sets and implementation of feature extraction. We also benchmark several widely used learning to rank methods on LETOR3.0, illustrating the results of these methods, suggesting new directions for research, and providing baseline results for future study. learning to rank, information retrieval, document sampling, feature extrac-Keywords: tion, 1.
NEW LEARNING FRAMEWORKS FOR INFORMATION RETRIEVAL
, 2011
"... Recent advances in machine learning have enabled the training of increasingly complex information retrieval models. This dissertation proposes principled approaches to formalize the learning problems for information retrieval, with an eye towards developing a unified learning framework. This will co ..."
Abstract
- Add to MetaCart
Recent advances in machine learning have enabled the training of increasingly complex information retrieval models. This dissertation proposes principled approaches to formalize the learning problems for information retrieval, with an eye towards developing a unified learning framework. This will conceptually simplify the overall development process, making it easier to reason about higher level goals and properties of the retrieval system. This dissertation advocates two complementary approaches, structured prediction and interactive learning, to learn feature-rich retrieval models that can perform well in practice.
Special Section on Information-Based Induction Sciences and Machine Learning A Short Introduction to Learning to Rank
"... SUMMARY Learning to rank refers to machine learning techniques for training the model in a ranking task. Learning to rank is useful for many applications in Information Retrieval, ..."
Abstract
- Add to MetaCart
SUMMARY Learning to rank refers to machine learning techniques for training the model in a ranking task. Learning to rank is useful for many applications in Information Retrieval,
Efficient Optimization of Performance Measures by Classifier Adaptation
"... Abstract—In practical applications, machine learning algorithms are often needed to learn classifiers that optimize domain specific performance measures. Previously, the research has focused on learning the needed classifier in isolation, yet learning nonlinear classifier for nonlinear and nonsmooth ..."
Abstract
- Add to MetaCart
Abstract—In practical applications, machine learning algorithms are often needed to learn classifiers that optimize domain specific performance measures. Previously, the research has focused on learning the needed classifier in isolation, yet learning nonlinear classifier for nonlinear and nonsmooth performance measures is still hard. In this paper, rather than learning the needed classifier by optimizing specific performance measure directly, we circumvent this problem by proposing a novel twostep approach called as CAPO, namely to first train nonlinear auxiliary classifiers with existing learning methods, and then to adapt auxiliary classifiers for specific performance measures. In the first step, auxiliary classifiers can be obtained efficiently by taking off-the-shelf learning algorithms. For the second step, we show that the classifier adaptation problem can be reduced to a quadratic program problem, which is similar to linear SVM perf and can be efficiently solved. By exploiting nonlinear auxiliary classifiers, CAPO can generate nonlinear classifier which optimizes a large variety of performance measures including all the performance measure based on the contingency table and AUC, whilst keeping high computational efficiency. Empirical studies show that CAPO is effective and of high computational efficiency, and even it is more efficient than linear SVM perf. Index Terms—Optimize performance measures, classifier adaptation, ensemble learning, curriculum learning 1

