Results 1 - 10
of
38
Query-Based Sampling of Text Databases
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1999
"... ... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive ..."
Abstract
-
Cited by 134 (13 self)
- Add to MetaCart
... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are created, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic database selection.
A Decision-Theoretic Approach to Database Selection in Networked IR
- ACM Transactions on Information Systems
, 1996
"... this paper, we address the resource discovery issue, which consists of two subtasks, namely database detection and database selection. Database detection can be performed relatively easily, either by exploiting the name conventions used in the domain name service of the internet (e.g. names of ftp s ..."
Abstract
-
Cited by 113 (14 self)
- Add to MetaCart
this paper, we address the resource discovery issue, which consists of two subtasks, namely database detection and database selection. Database detection can be performed relatively easily, either by exploiting the name conventions used in the domain name service of the internet (e.g. names of ftp servers should start with `ftp.', names of Web servers with `www.') or by establishing central registries (e.g. the directory-of-servers for WAIS systems)
Building efficient and effective metasearch engines
- ACM Computing Surveys
, 2002
"... Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a met ..."
Abstract
-
Cited by 107 (9 self)
- Add to MetaCart
Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a metasearch engine can be constructed. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user. Metasearch engines have other benefits as a search tool such as increasing the search coverage of the Web and improving the scalability of the search. In this article, we survey techniques that have been proposed to tackle several underlying challenges for building a good metasearch engine. Among the main challenges, the database selection problem is to identify search engines that are likely to return useful documents to a given query. The document selection problem is to determine what documents to retrieve from each identified search engine. The result merging problem is to combine the documents returned from multiple search engines. We will also point out some problems that need to be further researched.
Effective Retrieval with Distributed Collections
, 1998
"... This paper evaluates the retrieval effectiveness of distributed information retrieval systems in realistic environments. We find that when a large number of collections are available, the retrieval effectiveness is significantly worse than that of centralized systems, mainly because typical queries ..."
Abstract
-
Cited by 96 (13 self)
- Add to MetaCart
This paper evaluates the retrieval effectiveness of distributed information retrieval systems in realistic environments. We find that when a large number of collections are available, the retrieval effectiveness is significantly worse than that of centralized systems, mainly because typical queries are not adequate for the purpose of choosing the right collections. We propose two techniques to address the problem. One is to use phrase information in the collection selection index and the other is query expansion. Both techniques enhance the discriminatory power of typical queries for choosing the right collections and hence significantly improve retrieval results. Query expansion, in particular, brings the effectiveness of searching a large set of distributed collections close to that of searching a centralized collection. 1 Introduction In today's network environments, information is highly distributed. The Internet or World Wide Web, for example, contains thousands of collections. ...
Concept Hierarchy Based Text Database Categorization
, 2000
"... Document categorization as a technique to improve the retrieval of useful documents has been extensively investigated. One important issue in a large-scale metasearch engine is to select text databases that are likely to contain useful documents for a given query. We believe that database categoriza ..."
Abstract
-
Cited by 35 (6 self)
- Add to MetaCart
Document categorization as a technique to improve the retrieval of useful documents has been extensively investigated. One important issue in a large-scale metasearch engine is to select text databases that are likely to contain useful documents for a given query. We believe that database categorization can be a potentially effective technique for good database selection, especially in the Internet environment where short queries are usually submitted. In this paper, we propose and evaluate several database categorization algorithms. This study indicates that while some document categorization algorithms could be adopted for database categorization, algorithms that take into consideration the special characteristics of databases may be more effective. Preliminary experimental results are provided to compare the proposed database categorization algorithms. A prototype database categorization system based on one of the proposed algorithms has been developed.
Estimating the Usefulness of Search Engines
, 1999
"... In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. The estimates can be used by a metasearch engine to choose local search engines to invoke. For a given query, the usefulness of a search engine in this paper is defined to be a combinati ..."
Abstract
-
Cited by 32 (14 self)
- Add to MetaCart
In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. The estimates can be used by a metasearch engine to choose local search engines to invoke. For a given query, the usefulness of a search engine in this paper is defined to be a combination of the number of documents in the search engine that are sufficiently similar to the query and the average similarity of these documents. Experimental results indicate that the proposed estimation method is quite accurate. 1 Introduction Many search engines have been created on the Internet to help ordinary users find desired data. Each search engine has a corresponding database that defines the set of documents that can be searched by the search engine. Usually, an index for all documents in the database is created and stored in the search engine to speed up query processing. The amount of data in the Internet is huge (it is believed that by the end of 1997, there were more than 300 mil...
A Probabilistic Solution to the Selection and Fusion Problem in Distributed Information Retrieval
, 1999
"... A model for optimal information retrieval over a distributed document collection is described and experimentally evaluated. The fusion of retrieval results corresponding to document subcollections is performed according to the Probability Ranking Principle. Part of the model is a selection criterion ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
A model for optimal information retrieval over a distributed document collection is described and experimentally evaluated. The fusion of retrieval results corresponding to document subcollections is performed according to the Probability Ranking Principle. Part of the model is a selection criterion for effectively limiting the ranking process to a subset of subcollections.
Efficient and Effective Metasearch for a Large Number of Text Databases
, 1999
"... Metasearch engines can be used to facilitate ordinary users for retrieving information from multiple local sources (text databases). In a metasearch engine, the contents of each local database is represented by a representative. Each user query is evaluated against the set of representatives of all ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
Metasearch engines can be used to facilitate ordinary users for retrieving information from multiple local sources (text databases). In a metasearch engine, the contents of each local database is represented by a representative. Each user query is evaluated against the set of representatives of all databases in order to determine the appropriate databases to search. When the number of databases is very large, say in the order of tens of thousands or more, then a traditional metasearch engine may become inefficient as each query needs to be evaluated against too many database representatives. Furthermore, the storage requirement on the site containing the metasearch engine can be very large. In this paper, we propose to use a hierarchy of database representatives to improve the efficiency. We provide an algorithm to search the hierarchy. We show that the retrieval effectiveness of our algorithm is the same as that of evaluating the user query against all database representatives. We als...
Towards a Highly-Scalable and Effective Metasearch Engine
, 2001
"... A metasearch engine is a system that supports unified access to multiple local search engines. Database selection is one of the main challenges in building a large-scale metasearch engine. The problem is to efficiently and accurately determine a small number of potentially useful local search engine ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
A metasearch engine is a system that supports unified access to multiple local search engines. Database selection is one of the main challenges in building a large-scale metasearch engine. The problem is to efficiently and accurately determine a small number of potentially useful local search engines to invoke for each user query. In order to enable accurate selection, metadata that reect the contents of each search engine need to be collected and used. In this paper, we propose a highly scalable and accurate database selection method. This method has several novel features. First, the metadata for representing the contents of all search engines are organized into a single integrated representative. Such a representative yields both computation efficiency and storage efficiency. Second, our selection method is based on a theory for ranking search engines optimally. Experimental results indicate that this new method is very effective. An operational prototype system has been built based on the proposed approach.
Detection of Heterogeneities in a Multiple Text Database Environment
- IN PROCEEDINGS OF THE FOURTH IFCIS INTERNATIONAL CONFERENCE ON COOPERATIVE INFORMATION SYSTEMS
, 1999
"... As the number of text retrieval systems (search engines) grows rapidly on the World Wide Web, there is an increasing need to build search brokers (metasearch engines) on top of them. Often, the task of building an effective and efficient metasearch engine is hindered by the heterogeneities among the ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
As the number of text retrieval systems (search engines) grows rapidly on the World Wide Web, there is an increasing need to build search brokers (metasearch engines) on top of them. Often, the task of building an effective and efficient metasearch engine is hindered by the heterogeneities among the underlying local search engines. In this paper, we first analyze the impact of various heterogeneities on building a metasearch engine. We then present some techniques that can be used to detect the most prominent heterogeneities among multiple search engines. Applications of utilizing the detected heterogeneities in building better metasearch engines will be provided.

