Results 1 -
4 of
4
Is CORI effective for collection selection? an exploration of parameters, queries, and data
- in ‘Proceedings of Australian Document Computing Symposium
, 2004
"... Abstract In distributed information retrieval, a wide range of techniques have been proposed for choosing collections to interrogate. Many of these collection-selection techniques are based on ranking the lexicons; of these, arguably the best known is the CORI collection ranking metric, which includ ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Abstract In distributed information retrieval, a wide range of techniques have been proposed for choosing collections to interrogate. Many of these collection-selection techniques are based on ranking the lexicons; of these, arguably the best known is the CORI collection ranking metric, which includes several parameters that, in principle, should be tuned for different data sets. However, parameters chosen in early work on CORI have been used without alteration in almost all subsequent work, despite drastic differences in the data collections. We have explored the behaviour of CORI for a range of data sets and parameter values. It appears that parameters cannot reliably be chosen for CORI: not only do the optimal choices vary between data sets, but they also vary between query types and, indeed, vary wildly within query sets. Coupled with the observation that even CORI with optimal parameters is usually less effective than other methods, we conclude that the use of CORI as a benchmark collection selection method is inappropriate.
Experimentation, Measurement
"... Algorithms in distributed information retrieval often rely on accurate knowledge of the size of a collection. The “multiple capture-recapture ” method of Shokouhi et al. is one of the more reliable algorithms for determining collection size, but it relies on samples with a uniform number of document ..."
Abstract
- Add to MetaCart
Algorithms in distributed information retrieval often rely on accurate knowledge of the size of a collection. The “multiple capture-recapture ” method of Shokouhi et al. is one of the more reliable algorithms for determining collection size, but it relies on samples with a uniform number of documents. Such uniform samples are often hard to obtain in a working system. A simple generalisation of multiple capture-recapture does not rely on uniform sample sizes. Simulations show it is as accurate as the original method even when sample sizes vary considerably, making it a useful technique in real tools.
Evaluating Server Selection for Federated Search
"... Abstract. Previous evaluations of server selection methods for federated search have either used metrics which are unconnected with user satisfaction, or have not been able to account for confounding factors due to other search components. We propose a new framework for evaluating federated search s ..."
Abstract
- Add to MetaCart
Abstract. Previous evaluations of server selection methods for federated search have either used metrics which are unconnected with user satisfaction, or have not been able to account for confounding factors due to other search components. We propose a new framework for evaluating federated search server selection techniques. In our model, we isolate the effect of other confounding factors such as server summaries and result merging. Our results suggest that state-of-the-art server selection techniques are generally effective but result merging methods can be significantly improved. Furthermore, we show that the performance differences among server selection techniques can be obscured by ineffective merging. 1
To what problem is distributed information retrieval the solution?
"... Distributed information retrieval, where a single broker coordinates retrieval from many independent search services, has been extensively studied— but typically without any particular application and sometimes even without any explicit motivation. There have been a handful of arguments given for di ..."
Abstract
- Add to MetaCart
Distributed information retrieval, where a single broker coordinates retrieval from many independent search services, has been extensively studied— but typically without any particular application and sometimes even without any explicit motivation. There have been a handful of arguments given for distributed IR—coverage, effectiveness, and ease of use for example—but these are not borne out by experience. I ask: are there in fact uses for distributed IR? There are, but generally for organisational, not technical, reasons, and they have not been well-studied.

