Content-based full-text search still remains a particularly challenging problem in peer-to-peer (P2P) systems. Traditionally, there have been two index partitioning structures---partitioning based on the document space or partitioning based on keywords. The former requires search of every node in the system to answer a query whereas the latter transmits a large amount of data when processing multi-term queries. In this paper, we propose eSearch---a P2P keyword search system based on a novel hybrid indexing structure. In eSearch, each node is responsible for certain terms. Given a document, eSearch uses a modern information retrieval algorithm to select a small number of top (important) terms in the document and publishes the complete term list for the document to nodes responsible for those top terms. This selective replication of term lists allows a multi-term query to proceed local to the nodes responsible for query terms. We also propose automatic query expansion to alleviate the degradation of quality of search results due to the selective replication, overlay source multicast to reduce the cost of disseminating term lists, and techniques to balance term list distribution across nodes. eSearch is scalable and efficient, and obtains search results as good as state-of-the-art centralized systems. Despite the use of replication, eSearch actually consumes less bandwidth than systems based on keyword partitioning when publishing metadata for a document. During a retrieval operation, it searches only a small number of nodes and typically transmits a small amount of data (3.3KB) that is independent of the size of the corpus and grows slowly (logarithmically) with the number of nodes in the system. eSearch's efficiency comes at a modest storage cost, 6.8 times that of systems based on keyword partitioning. This cost can be further reduced by adopting index compression or pruning techniques.
|
2113
|
Chord: A scalable peer-to-peer lookup service for internet applications
– Stoica, Morris, et al.
|
|
1632
|
The anatomy of a large-scale hypertextual (Web) search engine
– Brin, Page
- 1998
|
|
739
|
A Case for End System Multicast
– Chu, Rao, et al.
- 2000
|
|
581
|
Wide-area cooperative storage with CFS
– Dabek, Kaashoek, et al.
- 2001
|
|
564
|
Managing Gigabytes: Compressing and Indexing Documents and Images
– Witten, Bell, et al.
- 1994
|
|
452
|
Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility
– Rowstron, Druschel
- 2001
|
|
337
|
Modeling internet topology
– Calvert, Doar, et al.
- 1997
|
|
304
|
Predicting Internet Network Distance with Coordinates-Based Approaches
– Ng, Zhan
- 2002
|
|
251
|
Okapi at TREC-3
– Robertson, Walker, et al.
- 1994
|
|
228
|
Routing Indices For Peer-to-Peer Systems
– Crespo, Garcia-Molina
- 2002
|
|
201
|
Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs
– BOLOSKY, DOUCEUR, et al.
- 2000
|
|
158
|
Text-source Discovery over the Internet
– Gravano, Garcia-Molina, et al.
- 1999
|
|
141
|
Implementation of the SMART information retrieval system
– Buckley
- 1985
|
|
134
|
C.Buckley, Improving Automatic Query Expansion
– Singhal
- 1998
|
|
122
|
Replication strategies in unstructured peer-to-peer networks
– Cohen, Shenker
- 2002
|
|
111
|
Peer-to-peer information retrieval using self-organizing semantic overlay networks
– Tang, Xu, et al.
- 2003
|
|
105
|
Efficient peer-to-peer keyword searching
– Reynolds, Vahdat
- 2002
|
|
102
|
A vector space model for information retrieval
– Salton, Wong, et al.
- 1975
|
|
101
|
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities
– Cuenca-Acuna, Peery, et al.
- 2002
|
|
86
|
Probabilistic location and routing
– Rhea, Kubiatowicz
- 2002
|
|
84
|
On the feasibility of peer-to-peer web indexing and search
– Li, Loo, et al.
- 2003
|
|
69
|
High availability, scalable storage, dynamic peer neetworks: Pick two
– Blake, Rodrigues
- 2003
|
|
67
|
Odissea: A peer-to-peer architecture for scalable web search and information retrieval
– Suel, Mathur, et al.
- 2003
|
|
51
|
Associative search in peer to peer networks: Harnessing latent semantics
– Cohen, Fiat, et al.
- 2003
|
|
46
|
Modern information retrieval: A brief overview
– Singhal
- 2001
|
|
43
|
A keyword set search system for peer-to-peer networks
– Gnawali
- 2002
|
|
39
|
Sets: search enhanced by topic segmentation
– Bawa, Manku, et al.
- 2003
|
|
35
|
Replication strategies in unstructured peer-to-peer networks
– Choen, Shenker
- 2002
|
|
34
|
Static index pruning for information retrieval systems
– Carmel, Cohen, et al.
- 2001
|
|
25
|
Optimized Query Execution in Large Search Engines with Global Page Ordering
– Long, Suel
- 2003
|
|
24
|
Distributed pagerank for p2p systems
– Sankaralingam, Sethumadhavan, et al.
|
|
24
|
A Scalable, Non-Hierarchical Resource Discovery Mechanism Based on Probabilistic Protocols
– Schwartz
- 1990
|
|
20
|
Search Engines and Web Dynamics
– Risvik, Michelsen
- 2002
|
|
20
|
Enabling efficient content location and retrieval in peer-to-peer systems by exploiting locality in interests
– Sripanidkulchai, Maggs, et al.
- 2002
|
|
12
|
Query processing and inverted indices in sharednothing document information retrieval systems
– TOMASIC, GARCIA-MOLINA
- 1993
|
|
2
|
Garc a-Molina. Routing Indices for Peer-to-peer Systems
– Crespo, H
- 2002
|
|
2
|
A Keyword Set Search System for Peerto -Peer Networks
– Gnawali
- 2002
|
|
2
|
Route Views Project. http://routeviews.org
– Oregon
|