With the proliferation of the world's "information highways " a renewed interest in efficient document indexing techniques has come about. In this paper, the problem of incremental updates of inverted lists is addressed using a new dual-structure index data structure. The index dynamically separates long and short inverted lists and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, a space of engineering tradeoffs which range from optimizing update time to optimizing query performance is described. We quantitatively explore this space by using actual data and hardware in combination with a simulation of an information retrieval system. We then describe the best algorithm for a variety of criteria. 1
|
2217
|
Introduction to Modern Information Retrieval
– Salton, McGill
- 1983
|
|
1651
|
R-trees: A dynamic index structure for spatial searching
– Guttman
- 1984
|
|
1446
|
The art of Computer Programming
– Knuth
- 1981
|
|
936
|
Database and Knowledge-Base Systems, Volume II
– Ullman
- 1989
|
|
542
|
Human Behavior and the Principle of Least Effort
– Zipf
- 1949
|
|
492
|
Art of Computer Programming, Volume 3: Sorting and Searching (2nd Edition
– Knuth
- 1998
|
|
363
|
The grid file: An adaptable, symmetric multikey file structure
– Nievergelt, Hinterberger, et al.
|
|
312
|
Searching Distributed Collections with Inference Networks
– Callan, Lu, et al.
- 1995
|
|
192
|
R+-tree: A dynamic index for multi-dimensional objects
– Sellis, Roussopoulos, et al.
- 1987
|
|
184
|
World-Wide Web: The Information Universe
– BERNERS-LEE, CAILLIAU, et al.
- 1992
|
|
172
|
Overview of the third text REtrieval conference (TREC-3), in Overview of the Third Text REtrieval Conference
– Harman
- 1995
|
|
170
|
Harvest: A Scalable, Customizable Discovery and Access System
– Bowman, Danzig, et al.
- 1994
|
|
164
|
Generalizing gloss to vector-space databases and broker hierarchies
– Gravano, García-Molina
- 1995
|
|
144
|
The Effectiveness of GlOSS for the Text-Database Discovery Problem
– Gravano, Garcia-Molina, et al.
- 1994
|
|
128
|
An information system for corporate users: Wide area information servers
– Kahle, Medlar
- 1991
|
|
124
|
A class of data structures for associative searching
– ORENSTEIN, T
- 1984
|
|
96
|
The Collection Fusion Problem
– Voorhees, Gupta, et al.
- 1995
|
|
78
|
A Comparison of Internet Resource Discovery Approaches
– Schwartz, Emtage, et al.
- 1992
|
|
76
|
K.: Incremental Updates of Inverted Lists for Text Document Retrieval. Short Version of
– Tomasic, Garcia-Molina, et al.
- 1993
|
|
71
|
INTERNET resource discovery services
– Obraczka, Danzig, et al.
- 1993
|
|
63
|
Fast incremental indexing for full-text information retrieval
– Brown, Callan, et al.
- 1994
|
|
57
|
Retrieving records from a gigabyte of text on a minicomputer using statistical ranking
– Harman, Candela
- 1990
|
|
57
|
An e cient indexing technique for fulltext database systems
– Zobel, at, et al.
- 1992
|
|
56
|
Optimizations for dynamic inverted index maintenance
– Cutting, Pedersen
- 1990
|
|
43
|
The Prospero File System: A Global File System Based on the Virtual System Model
– Neuman
- 1992
|
|
41
|
The Rufus system: information organization for semistructured data
– Shoens, Luniewski, et al.
- 1993
|
|
39
|
Distributed Active Catalogs and Meta-Data Caching in Descriptive Name Services
– Ordille, Miller
- 1993
|
|
36
|
A new algorithm for computing joins with grid files
– Becker, Hinrichs, et al.
- 1993
|
|
36
|
A General Solution of the n-dimensional B-tree Problem
– Freeston
- 1995
|
|
35
|
File organization for database design
– Wiederhold
- 1987
|
|
34
|
On B-tree indices for skewed distributions
– Faloutsos, Jagadish
- 1992
|
|
34
|
Distributed indexing: a scalable mechanism for distributed information retrieval", SIGIR
– Danzig, Ahn, et al.
- 1991
|
|
31
|
Precision and recall of GlOSS estimators for database discovery
– Gravano, a-Molina, et al.
- 1994
|
|
30
|
Multiattribute hashing using Gray codes
– Faloutsos
- 1986
|
|
29
|
Sparse matrix technology
– Pissanetzky
- 1984
|
|
28
|
Information Brokers: Sharing Knowledge in a Heterogeneous Distributed System
– Barbara, Clifton
- 1992
|
|
24
|
A Scalable, Non-Hierarchical Resource Discovery Mechanism Based on Probabilistic Protocols
– Schwartz
- 1990
|
|
21
|
Content routing in a network of WAIS servers
– Duda, Sheldon
- 1994
|
|
20
|
Optimal partial-match retrieval when fields are independently specified
– Aho, Ullman
- 1979
|
|
19
|
Hybrid index organizations for text databases
– Faloutsos, Jagadish
- 1992
|
|
18
|
Data structures for efficient broker implementation
– TOMASIC, GRAVANO, et al.
- 1997
|
|
18
|
Siemens TREC-4 report: Further experiments with database merging
– Voorhees
- 1995
|
|
17
|
A content routing system for distributed information servers
– Sheldon, Duda, et al.
- 1994
|
|
17
|
Querying a network of autonomous databases
– Simpson, Alonso
- 1989
|
|
16
|
Frakes and Ricardo Baeza�Yates. Information Retrieval� Data Structures and Algorithms
– William
- 1992
|
|
16
|
Implementation of the grid file: Design concepts and experience
– Hinrichs
- 1985
|
|
11
|
Full-text Document Retrieval Benchmark
– DeFazio
- 1993
|
|
10
|
Katia Obraczka. Distributed indexing of autonomous internet services
– Danzig, Li
- 1992
|
|
10
|
About the Veronica service
– Foster
- 1992
|
|
10
|
Optimal partial-match retrieval
– LLOYD
- 1980
|