Download:
by Edleno S. De Moura, Altigran S. Silva, Pavel Calado, Daniel R. Fernandes, Mario A. Nascimento
In Proceedings of the 14th International Conference on World Wide Web
http://www.www2005.org/cdrom/docs/p235.pdf
Add To MetaCart
Abstract:
The unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the Web poses a great challenge for search engines. This is true regarding not only search effectiveness but also time and space efficiency. In this paper we present an index pruning technique targeted for search engines that addresses the latter issue without disconsidering the former. To this effect, we adopt a new pruning strategy capable of greatly reducing the size of search engine indices. Experiments using a real search engine show that our technique can reduce the indices ’ storage costs by up to 60 % over traditional lossless compression methods, while keeping the loss in retrieval precision to a minimum. When compared to the indices size with no compression at all, the compression rate is higher than 88%, i.e., less than one eighth of the original size. More importantly, our results indicate that, due to the reduction in storage overhead, query processing time can be reduced to nearly 65 % of the original time, with no loss in average precision. The new method yields significative improvements when compared against the best known static pruning method for search engine indices. In addition, since our technique is orthogonal to the underlying search algorithms, it can be adopted by virtually any search engine.
Citations
|
1646
|
Authoritative Sources in a Hyperlinked Environment
– Kleinberg
- 1999
|
|
1594
|
Indexing by latent semantic analysis
– Deerwester, Dumais, et al.
- 1990
|
|
1376
|
Modern Information Retrieval
– Baeza-Yates, Ribeiro-Neto
- 1999
|
|
576
|
Text Compression
– Bell, Cleary, et al.
- 1990
|
|
129
|
Introduction to Modern Information Retrieval. McGraw-Hill Computer Science Series
– Salton, McGill
- 1983
|
|
119
|
Managing Gigabytes
– Witten, Moffat, et al.
- 1999
|
|
108
|
The use of phrases and structured queries in information retrieval
– Croft, Turtle, et al.
- 1991
|
|
91
|
A large-scale study of the evolution of web pages
– Fetterly, Manasse, et al.
- 2003
|
|
78
|
Comparing top k lists
– Fagin, Kumar, et al.
- 2003
|
|
70
|
Filtered document retrieval with frequency-sorted indexes
– Persin, Zobel, et al.
- 1996
|
|
62
|
Overview of the TREC-8 Web track
– Hawking, Voorhees, et al.
- 1999
|
|
53
|
Overview of TREC-7 very large collection track
– Hawking, Craswell, et al.
- 1998
|
|
53
|
Results and challenges in web search evaluation
– Hawking, Craswell, et al.
- 1999
|
|
44
|
Fast and flexible word searching on compressed text
– Moura, Navarro, et al.
- 2000
|
|
38
|
Static Index Pruning for Information Retrieval Systems
– Carmel, Cohen, et al.
- 2001
|
|
22
|
Rank-preserving two-level caching for scalable search engines
– Saraiva, Moura, et al.
- 2001
|
|
15
|
Efficient phrase querying with an auxiliary index
– Bahle, Williams, et al.
- 2002
|
|
13
|
Dynamic maintenance of web indexes using landmarks
– Lim, Wang, et al.
- 2003
|
|
10
|
Local versus global link information in the web
– CALADO, RIBEIRO-NETO, et al.
- 2003
|