Results 1 - 10
of
10
Evaluating the Performance of Distributed Architectures for Information Retrieval using a Variety of Workloads
- ACM Transactions on Information Systems
, 1997
"... Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we desc ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we describe a fully functional distributed IR system based on the Inquery unified IR system. To refine this prototype, we implement a flexible simulation model which we use to present a series of experiments using a variety of workloads that measure system performance. We vary numerous system parameters such as the number of users, document collections, terms per query, query term frequency, think time, answers returned, and workload. Based on our initial results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems-- distributed applications; C.4 [Performance of Systems]: Performance Attributes; H.3.4 [Information Storage and Retrieval]: Systems and Software; General Terms: Experimentation, Performance Additional Key Words and Phrases: Distributed information retrieval architectures This material is based on work supported by ...
Partial Collection Replication versus Caching for Information Retrieval Systems
- IN THE ACM INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL
, 2000
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have lo ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have locality, both mechanisms return results more quickly than sending queries to the original collection (s). Caches return results when queries exactly match a previous one. Partial replicas are a form of caching that return results when the IR technology determines the query is a good match. Caches are simpler and faster, but replicas can increase locality by detecting similarity between queries that are not exactly the same. We use real traces from THOMAS and Excite to measure query locality and similarity. With a very restrictive definition of query similarity, similarity improves query locality up to 15% over exact match. We use a validated simulator to compare their performance, and find that even if the partial replica hit rate increases only 3 to 6%, it will outperform simple caching under a variety of configurations. A combined approach will probably yield the best performance.
Design of a Parallel and Distributed Web Search Engine
- IN PROCEEDINGS OF PARALLEL COMPUTING (PARCO) 2001 CONFERENCE. IMPERIAL
, 2001
"... This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture ca ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture can be easily adjusted to fulfill the bandwidth requirements of the application at hand. Both task-parallel and data-parallel approaches are exploited within MOSE in order to increase the throughput and efficiently use communication, storing and computational resources. We used
Scalable Distributed Architectures for Information Retrieval
, 1999
"... SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the In ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining ...
Efficiency Considerations for Scalable Information Retrieval Servers
, 2000
"... We review a variety of techniques to improve efficiency in information retrieval. Given the increasing volumes of data that are available electronically, understanding and using such techniques is critical. ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
We review a variety of techniques to improve efficiency in information retrieval. Given the increasing volumes of data that are available electronically, understanding and using such techniques is critical.
Partial collection replication for information retrieval
- Information Retrieval
, 1999
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed I ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanisms. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the performance of partial replication is better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to show to build partial replicas and caches from frequent queries. We show that searching replicas can improve locality (from 3 to 20%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4 % in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.
Maintaining Retrieval Effectiveness in Distributed, Dynamic Information Retrieval Systems
, 1996
"... Traditional information retrieval (IR) techniques were developed under the tacit assumptions of static, centralized archives of documents. Advanced techniques invariably use information derived from the entire collection in an effort to produce high-quality responses to user queries. In dynamic, dis ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Traditional information retrieval (IR) techniques were developed under the tacit assumptions of static, centralized archives of documents. Advanced techniques invariably use information derived from the entire collection in an effort to produce high-quality responses to user queries. In dynamic, distributed information environments these assumptions are clearly not met. Heretofore easily obtainable collection wide information (CWI) may be unavailable to some or all member sites in a distributed document archive, so some degree of incompleteness or inconsistency must be tolerated. In this dissertation, we present a rigorous empirical study investigating how allowing the view of CWI to drift from rigorously defined values influences retrieval effectiveness. We give a generic model for searching a document collection that allows for the use of CWI derived from a subset of the collection. Within this model, we identify two realistic scenarios where the use of subset-derived collection stat...
Searching a Terabyte of Text Using Partial Replication
, 1999
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a rep ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a replica selection database to direct queries to relevant replicas that maintain query effectiveness, but at the same time restricts some searches to a small percentage of data to improve performance and scalability, and to reduce network latency. Using a validated simulator, we compare database partitioning to partial replication with load balancing, and find partial replication is much more effective at decreasing query response time, even with fewer resources, and it requires only modest query locality. We also demonstrate the average query response time under 10 seconds for a variety of work loads with partial replication on a terabyte text database. We further investigate query locality with respect to time, replica size, and replica updating costs using real logs from THOMAS and Excite, and discuss the sensitivity of our results to these sample points.
Searching a Terabyte of Text Using Partial Replication
, 1999
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a rep ..."
Abstract
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. In this paper, we investigate using partial replication to search a terabyte of text in our distributed IR system. We use a replica selection database to direct queries to relevant replicas that maintain query effectiveness, but at the same time restricts some searches to a small percentage of data to improve performance and scalability, and to reduce network latency. Using a validated simulator, we compare database partitioning to partial replication with load balancing, and find partial replication is much more effective at decreasing query response time, even with fewer resources, and it requires only modest query locality. We also demonstrate the average query response time under 10 seconds for a variety of work loads with partial replication on a terabyte text database. We further investigate query locality with respect to time, replica size, and replica updating costs using real logs from THOMAS and Excite, and discuss the sensitivity of our results to these sample points. 1
The Effect Of Collection Organization And Query Locality On Information Retrieval System Performance And Design
, 1999
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. Collection selection and partial collection replication with replica selection are two such mechanisms that enable IR systems to s ..."
Abstract
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. Collection selection and partial collection replication with replica selection are two such mechanisms that enable IR systems to search a small percentage of data and thus improve performance and scalability. To maintain effectiveness simultaneously, IR systems must be configured carefully, and consider workload locality, possible collection organizationscollection organization, and any interaction that results. This work builds on previous results which have focused on maintaining effectiveness. We propose IR system architectures with collection selection and partial replication based on collection organization and query locality characteristics which maintain accuracy and achieve high performance. We compare configurations using a validated simulator that partition data and replicate data, and their sensitivities to ...

