Experimentation, Theory
Abstract:
This paper describes a simple way of adapting the BM25 ranking formula to deal with structured documents. In the past it has been common to compute scores for the individual fields (e.g. title and body) independently and then combine these scores (typically linearly) to arrive at a final score for the document. We highlight how this approach can lead to poor performance by breaking the carefully constructed non-linear saturation of term frequency in the BM25 function. We propose a much more intuitive alternative which weights term frequencies before the nonlinear term frequency saturation function is applied. In this scheme, a structured document with a title weight of two is mapped to an unstructured document with the title content repeated twice. This more verbose unstructured document is then ranked in the usual way. We demonstrate the advantages of this method with experiments on Reuters Vol1 and the TREC dotGov collection.
Citations
| 196 | Some simple effective approximations to the 2–poisson model for probabilistic weighted retrieval – Robertson, Walker - 1994 |
| 97 | Effective retrieval of structured documents – Wilkinson - 1994 |
| 46 | A.: Searching XML documents via XML fragments – Carmel, Maarek, et al. - 2003 |
| 40 | Combining document representations for known item search – Ogilvie, Callan - 2003 |
| 33 | Overview of the TREC-2002 Web track – Craswell, Hawking - 2003 |
| 10 | Structured Information Retrieval in XML documents – Kotsakis - 2002 |
| 6 | Uniform representation of content and structure for structured document retrieval – Lalmas - 2000 |
| 1 | Initiative for the evaluation of xml retrieval (inex – INEX |
| 1 | T.Upstill, R.Wilkinson, and M.Wu. Trec12 web track at csiro – Craswell, McLean - 2003 |
| 1 | A machine learning model for information retrieval with structured documents – Piwowarski, Gallinari - 2003 |
| 1 | Reuters corpus volume – ReutersI |

