Results 1 - 10
of
13
Evaluating the Performance of Distributed Architectures for Information Retrieval using a Variety of Workloads
- ACM Transactions on Information Systems
, 1997
"... Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we desc ..."
Abstract
-
Cited by 33 (7 self)
- Add to MetaCart
Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we describe a fully functional distributed IR system based on the Inquery unified IR system. To refine this prototype, we implement a flexible simulation model which we use to present a series of experiments using a variety of workloads that measure system performance. We vary numerous system parameters such as the number of users, document collections, terms per query, query term frequency, think time, answers returned, and workload. Based on our initial results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate. Categories and Subject Descriptors: C.2.4 [Computer-Communication Networks]: Distributed Systems-- distributed applications; C.4 [Performance of Systems]: Performance Attributes; H.3.4 [Information Storage and Retrieval]: Systems and Software; General Terms: Experimentation, Performance Additional Key Words and Phrases: Distributed information retrieval architectures This material is based on work supported by ...
Performance Evaluation of a Distributed Architecture for Information Retrieval
- In Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1996
"... Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we descri ..."
Abstract
-
Cited by 28 (7 self)
- Add to MetaCart
Information explosion across the Internet and elsewhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this paper, we describe a fully functional distributed IR system based on the Inquery unified IR system. To refine this prototype, we implement a flexible simulation model that analyzes performance issues given a wide variety of system parameters and configurations. We present a series of experiments that measure response time, system utilization, and identify bottlenecks. We vary numerous system parameters, such as the number of users, text collections, terms per query, and workload to generalize our results for other distributed IR systems. Based on our initial results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of info...
Methodologies for Distributed Information Retrieval
- In ICDCS
, 1998
"... Text collections have traditionally been located at a single site and managed as a monolithic whole. However, it is now common for a collection to be spread over several hosts and for these hosts to be geographically separated. In this paper we examine several alternative approaches to distributed t ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Text collections have traditionally been located at a single site and managed as a monolithic whole. However, it is now common for a collection to be spread over several hosts and for these hosts to be geographically separated. In this paper we examine several alternative approaches to distributed text retrieval. We report on our experience with a full implementation of these methods, and give retrieval efficiency and retrieval effectiveness results for collections distributed over both a local area network and a wide area network. We conclude that, compared to monolithic systems, distributed information retrieval systems can be fast and effective, but that they are not efficient.
Partial Collection Replication versus Caching for Information Retrieval Systems
- IN THE ACM INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL
, 2000
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have lo ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms to attain timely and accurate retrieval of unstructured text. In this paper, we compare two mechanisms to improve IR system performance: partial collection replication and caching. When queries have locality, both mechanisms return results more quickly than sending queries to the original collection (s). Caches return results when queries exactly match a previous one. Partial replicas are a form of caching that return results when the IR technology determines the query is a good match. Caches are simpler and faster, but replicas can increase locality by detecting similarity between queries that are not exactly the same. We use real traces from THOMAS and Excite to measure query locality and similarity. With a very restrictive definition of query similarity, similarity improves query locality up to 15% over exact match. We use a validated simulator to compare their performance, and find that even if the partial replica hit rate increases only 3 to 6%, it will outperform simple caching under a variety of configurations. A combined approach will probably yield the best performance.
Design of a Parallel and Distributed Web Search Engine
- IN PROCEEDINGS OF PARALLEL COMPUTING (PARCO) 2001 CONFERENCE. IMPERIAL
, 2001
"... This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture ca ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture can be easily adjusted to fulfill the bandwidth requirements of the application at hand. Both task-parallel and data-parallel approaches are exploited within MOSE in order to increase the throughput and efficiently use communication, storing and computational resources. We used
Scalable Distributed Architectures for Information Retrieval
, 1999
"... SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the In ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
SCALABLE DISTRIBUTED ARCHITECTURES FOR INFORMATION RETRIEVAL MAY 1999 ZHIHONG LU B.Sc., TONGJI UNIVERSITY M.Sc., INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES Ph.D., UNIVERSITY OF MASSACHUSETTS AMHERST Directed by: Professor Kathryn S. McKinley As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining ...
Performance Analysis of Distributed Information Retrieval
- Artificial Intelligence Laboratory, Massachusetts Institute of Technology
, 1995
"... Large document collections are increasingly available over the network. In order for users to access these collections, information retrieval systems must provide coordinated, concurrent, and distributed access. Since even unified information retrieval (IR) systems place heavy demands on system re ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Large document collections are increasingly available over the network. In order for users to access these collections, information retrieval systems must provide coordinated, concurrent, and distributed access. Since even unified information retrieval (IR) systems place heavy demands on system resources, it is unclear how performance will be affected as user demand increases and the distributed IR systems grow in size. In this paper, we present the implementation of a simulator and a prototype system, and the design for experiments to study the performance of distributed IR systems. The prototype distributed information retrieval system is based on Inquery, an existing, unified IR system. We have implemented a flexible simulation model to serve as a platform for analyzing performance issues given a wide variety of system parameters and configurations. We validate the accuracy of our simulation model using the prototype. We present a series of experiments that are designed to measure system utilization and identify bottlenecks. We vary numerous system parameters, such as the number of users and text collections, number of terms per query, response time, and system load to generalize our results for other distributed IR systems. 1
Partial collection replication for information retrieval
- Information Retrieval
, 1999
"... The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed I ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The explosion of content in distributed information retrieval (IR) systems requires new mechanisms in order to attain timely and accurate retrieval of unstructured text. This paper shows how to exploit locality by building, using, and searching partial replicas of text collections in a distributed IR system. In this work, a partial replica includes a subset of the documents from larger collection(s) and the corresponding inference network search mechanisms. For each query, the distributed system determines if partial replica is a good match and then searches it, or it searches the original collection. We demonstrate the performance of partial replication is better than systems that use caches which only store previous query and answer pairs. We first use logs from THOMAS and Excite to show to build partial replicas and caches from frequent queries. We show that searching replicas can improve locality (from 3 to 20%) over the exact match required by caching. Replicas increase locality because they satisfy queries which are distinct but return the same or very similar answers. We then present a novel inference network replica selection function. We vary its parameters and compare it to previous collection selection functions, demonstrating a configuration that directs most of the appropriate queries to replicas in a replica hierarchy. We then explore the performance of partial replication in a distributed IR system. We compare it with caching and partitioning. Our validated simulator shows that the increases in locality due to replication make it preferable to caching alone, and that even a small increase of 4 % in locality translates into a performance advantage. We also show a hybrid system with caches and replicas that performs better than each on their own.
Basic issues on the processing of web queries
- In: Proceedings of the 28th Annual International ACM SIGIR Conference on Reseach and Development in Information Retrieval (SIGIR’05
, 2005
"... Search engines represent a key component of Web economy these days. Despite that, there is not much technical literature available on their design, fine tuning, and internal operation. In this work, we make a preliminary attempt to partially fulfill this gap. We distinguish that Web query processing ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Search engines represent a key component of Web economy these days. Despite that, there is not much technical literature available on their design, fine tuning, and internal operation. In this work, we make a preliminary attempt to partially fulfill this gap. We distinguish that Web query processing is composed of two phases: (a) retrieving information on documents related to the queries and ranking them, and (b) generating snippets, title, and URL information for the answer page. The second phase has cost that is basically constant on the size of the collection, while the cost of the first phase is affected by the size of the collection. Thus, we concentrate here on studying the behavior of a search engine while executing the first phase of query processing. Using real data and a small cluster of index servers, we study four basic and key issues related to this first phase of query processing: load balance, broker behavior, performance by individual index servers, and overall throughput. Our study, while preliminary, does reveal interesting tradeoffs: (1) that load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) that the broker is not a bottleneck, (3) that disk and CPU utilization at individual servers depends on the relationship between memory size and the distribution of frequencies for the query terms, and (4) that load unbalance at high loads prevents higher throughput. Our results suggest that further studying and evaluating search engines is a promising research avenue.
ABSTRACT Distributed Processing of Conjunctive Queries
"... We distinguish that Web query processing is composed of two phases: (a) retrieving information on documents related to the queries and ranking them, and (b) generating snippets, title, and URL information for the answer page. Using real data and a small cluster of index servers, we study four basic ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We distinguish that Web query processing is composed of two phases: (a) retrieving information on documents related to the queries and ranking them, and (b) generating snippets, title, and URL information for the answer page. Using real data and a small cluster of index servers, we study four basic and key issues related to this first phase of query processing: load balance, broker behavior, performance by individual index servers, and overall throughput. Our study reveals interesting tradeoffs: (1) that load unbalance at low query arrival rates can be controlled with a simple measure of randomizing the distribution of documents among the index servers, (2) that the broker is not a bottleneck, (3) that disk and CPU utilization at individual servers depends on the relationship between memory size and the distribution of frequencies for the query terms, and (4) that load unbalance at high loads prevents higher throughput. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems

