MetaCartSign in to MyCiteSeer

Include Citations | Advanced Search | Help

Include Citations | Advanced Search | Help

  Experimentation, Theory

Download:
pdf
by Stephen Robertson, Hugo Zaragoza, Michael Taylor
http://research.microsoft.com/users/hugoz/pubs/pdf/ser_sigir04.pdf
Add To MetaCart

Abstract:

This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies before the nonlinear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.

Citations

196 Some simple effective approximations to the 2–poisson model for probabilistic weighted retrieval – Robertson, Walker - 1994
97 Effective retrieval of structured documents – Wilkinson - 1994
46 A.: Searching XML documents via XML fragments – Carmel, Maarek, et al. - 2003
40 Combining document representations for known item search – Ogilvie, Callan - 2003
33 Overview of the TREC-2002 Web track – Craswell, Hawking - 2003
10 Structured Information Retrieval in XML documents – Kotsakis - 2002
6 Uniform representation of content and structure for structured document retrieval – Lalmas - 2000
1 Initiative for the evaluation of xml retrieval (inex – INEX
1 T.Upstill, R.Wilkinson, and M.Wu. Trec12 web track at csiro – Craswell, McLean - 2003
1 A machine learning model for information retrieval with structured documents – Piwowarski, Gallinari - 2003
1 Reuters corpus volume – ReutersI