| A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349-- 379, October 1996. |
....to element names or indexes can be answered directly, the ancestor descendant relationship can be inferred from the sequence of path steps, and the sequence index allows for identifying preceding or following elements. As an example, consider the path book[1,1] chapter[1, 3] section[2,3] p[3,4]. Here the different steps are separated by a slash. For each step, we first give the element name and then a pair of indexes, namely the element index and the sequence index. The chapter element for example is located on the second level of the XML document tree. It is the first chapter node ....
.... An obvious format for encoding such paths is described by the following syntax: path path length element element index sequence index As a simple example, consider an inverted list entry which refers to a document position referenced by the path book[1,1] chap ter[1,3] section[2,3] p[3,4]. The corresponding information to encode would look like this (here angular brackets and spacing are used for illustration purposes only) 1 1 2 3 1 3 3 4 Given that entries in inverted lists are sorted by position, there is some redundancy, which can be used for compression. As a first ....
[Article contains additional citation context not shown here]
Alistair Moffat and Justin Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....15.28 15.03 10 16.15 15.98 15.08 14.56 13.95 14.37 14.13 14.00 13.80 3 Query Processing Recent work has focused on improving query run time efficiency. Moffat and Zobel have shown that query performance can be improved by modifying the inverted index to support fast scanning of a posting list [28, 29]. Other work has shown that reasonable effectiveness can be obtained by retrieving fewer terms in the query [18] A recent study showed that the computation can be reduced even further by eliminating some of the complexity found in the vector space model [21] In this section, we review some ....
....next partition or if the current partition should be scanned. The process continues until the partition is found and the document desired is matched against the elements of the partition. A partition size of about 1,000 resulted in the best CPU time for a set of TREC queries against the TREC data [29]. 3.2 Partial Result Set Retrieval Another way to improve run time performance is to stop processing after some threshold of computational resources has been expended. One approach has been to count disk I O and stop after a threshold has been Table 10: Training Sets vs Number of Unique Terms ....
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....Science conference, Lecture Notes in Computer Science, Springer Verlag 1530 (1998) pages 186 196. 1 Introduction and Motivation Given a text string, full text indexing is the problem of preprocessing the text so that search for a pattern string can be done efficiently. While inverted lists [4, 18] and signature files [27] can be used for indexing texts that are structured as long sequences of words or keys, suffix trees and suffix arrays are much more directly assignable to the problem of full text search. A suffix tree [17] is a trie in which the leaves correspond to the suffixes of the ....
A. Moffat and J. Zobel, Self-indexing inverted files for fast text retrieval, ACM Transactions on Information Systems 14(4) (1996) 349-379.
....structure (or of having to process the inverted lists in stages to stay within a memory limit) A simple heuristic to address this problem is to directly merge the inverted lists rather than decode them in turn. On the one hand, merging has the disadvantage that techniques such as skipping [11] cannot be as easily used to reduce processing costs (although as we discuss later skipping does not necessarily yield significant benefits) On the other hand, merging of at least some of the inverted lists is probably the only viable option when all the query terms are moderately common. ....
....of the term will be in a query phrase, and thus such reordering is unlikely to be e#ective. O#sets only have to be decoded when there is a document match, but they still have to be retrieved. Other techniques do have the potential to reduce query evaluation time, in particular skipping [11], in which additional information is placed in inverted lists to reduce the decoding required in regions in the list that cannot contain postings that will match documents that have been identified as potential matches. On older machines, on which CPU cycles were relatively scarce, skipping could ....
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, Oct. 1996.
....slower: as we show later, it is up to twice as slow as the fastest compressed scheme. In this paper, we revisit compression schemes for the inverted list component of inverted indexes. There have been a great many reports of experiments on compression of indexes with bitwise compression schemes [5, 7, 12, 14, 15], which use an integral number of bits to represent each integer, usually with no restriction on the alignment of the integers to byte or machine word boundaries. We consider several aspects of these schemes: how to decode bitwise representations of integers e#ciently; how to minimise the ....
....each inverted list, an accumulator weight Ad is increased; the magnitude of the increase is dependent on the similarity measure used, and can consider the weight w q,t of term t in the query q,theweightw d,t of the term t in the document d, and other factors. Fourth, after processing part [1, 5] or all of the lists, the accumulator scores are partially sorted to identify the most similar documents. Last, for a typical search engine, document summaries of the top ten documents are generated or retrieved and shown to the user. The o#sets stored in each inverted list posting are not used in ....
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, Oct. 1996.
....with the mixed list scheme, access to specific portions of the inverted list is available. For example, in Figure 6, to retrieve locations for cat starting at 311, we do not have to read the portion of the list for locations 100 280. The skipped list and random inverted list structures of [15] and [16] also provide selective access to portions of an inverted list, by dividing the inverted list into blocks each containing a fixed number of postings. However, those schemes assume a custom inverted file implementation and are not built on top of an existing data management system. Hot ....
....updates. We wished to explore strategies for gathering statistics during index construction. A great deal of work has been done on several other issues, relevant to inverted index based information retrieval, that have not been discussed in this paper. Such issues include index compression [15, 16, 25], incremental updates [9, 22, 25, 26] and distributed query performance [21] 7. CONCLUSIONS In this paper we addressed the problem of e#ciently constructing inverted indexes over large collections of Web pages. We proposed a new pipelining technique to speed up index construction and showed ....
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....Performance issues in hybrid file sharing P2P systems have been compared to issues studied in information retrieval (IR) systems, since both systems provide a lookup service, and use inverted lists. Much work has been done on optimizing inverted list and overall IR system performance (e.g. [17, 25]) However, while the IR domain has many ideas applicable to P2P systems, there are differences between the two types of systems such that many optimization techniques cannot be directly applied. For example, IR and P2P systems have large differences in update frequency. Large IR systems with ....
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
.... text of n binary symbols, suffix arrays use n words of log n bits each (a total of n log n bits) while suffix trees require between 4n and 5n words (or between 4n log n and 5n log n bits) 35] In contrast, inverted lists require less than 0:1 n= log n words (or 0:1n bits) in many practical cases [38] in order to index a set of words consisting of a total of n bits. However, as previously mentioned, inverted files have less functionality than suffix arrays and suffix trees since only the words are indexed, whereas suffix arrays and suffix trees index all substrings of the text. No data ....
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, Oct. 1996.
....the disk blocks in [97] and dynamically selects large inverted lists to be managed separately. It is notable that they expect the scheme with the best incremental update performance to have the worst query processing performance due to fragmentation of the long inverted lists. Moffat and Zobel [60] describe an inverted list implementation that supports jumping forward in the list using skip pointers. This is useful for document based access into the list during conjunctive style processing. The purpose of these skip pointers is to provide synchronization points for decompression, allowing ....
....only update the scores of existing documents. This scheme can make no guarantees about the membership of the set. It does, however, calculate complete scores for the documents in the candidate set, guaranteeing a correct relative ranking. The second variation was proposed by Moffat and Zobel [60, 58, 59]. Rather than use an insertion threshold related to a term s potential score contribution, a hard limit is placed on the size of the candidate document set. The disjunctive phase proceeds until the candidate set is full. Then, the conjunctive phase proceeds until all of the query terms have been ....
[Article contains additional citation context not shown here]
Moffat, A. and Zobel, J. Self-indexing inverted files for fast text retrieval. Technical Report 94/2, Collaborative Information Technology Research Institute, Department of Computer Science, Royal Melbourne Institute of Technology, Australia, Feb. 1994.
....of unique terms 160 140078 80 97408 40 69651 20 48730 10 34953 3 Query Processing Recent work has focused on improving query run time efficiency. Moffat and Zobel have shown that query performance can be improved by modifying the inverted index to support fast scanning of a posting list [21, 22]. Other work has shown that reasonable effectiveness can be obtained by retrieving fewer terms in the query [15] A recent study showed that the computation can be reduced even further by eliminating some of the complexity found in the vector space model [17] In this section, we review some ....
....next partition or if the current partition should be scanned. The process continues until the partition is found and the document desired is matched against the elements of the partition. A small size, d, of about 1,000 resulted in the best CPU time for a set of TREC queries against the TREC data [22]. 3.2 Partial Result Set Retrieval Another way to improve run time performance is to stop processing after some threshold of computational resources has been expended. One approach has been to count disk I O and stop after a threshold has been reached [33] The key to this approach is to sort ....
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
No context found.
A. Moffat and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349-- 379, October 1996.
No context found.
A. Moffat and J. Zobel, `Self-indexing inverted files for fast text retrieval', ACM Transactions on Information Systems. To appear.
No context found.
A. Moffat and J. Zobel, `Self-indexing inverted files for fast text retrieval', ACM Transactions on Information Systems. To appear. Preliminary version in Proc 5'th Australasian Database Conference, Christchurch, New Zealand, January 1994, pp. 79--91.
....of approximately K answers, but in a di#erent order, and so must also provide an approximation to the full ranking. Furthermore, it is the longest inverted lists that are left untouched, so execution is very fast. Detailed descriptions and pseudo code for these two strategies are given elsewhere [11]. Fo r trec, the number of accumulators increases quickly as terms are processed. Even the rarer terms in the trec queries occur in thousands of documents. For example, when processing topics 51 100 a limit of 10,000 accumulators is on average reached after the inverted lists of only six terms ....
....greatly reduce the time required to compute a ranking. One method to reclaim some of this waste is to insert into each inverted list a simple index, consisting of pointers at intervals that in e#ect divide the inverted list into a sequence of regular sized blocks chained together by the pointers [11]. Each pointer or skip stores the bit address of the start of the next skip as well as the first d value in the block. This arrangement is illustrated in Figure 4. inverted list pairs skip Figure 4: Adding skips The skips can be used to decide whether any of the pairs within the ....
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems.Toappear.
....identifying which documents should be returned. However, with a good heuristic ranked queries provide more e#ective retrieval than Boolean queries in terms of satisfying an information need [16] and ranked queries can be evaluated in time similar to that required by an equivalent Boolean query [14, 15]. The drawback of ranked queries in a distributed environment is that the most successful heuristics make use of several collection dependent statistics, including the total number of documents; the number of distinct terms; the number of documents in which each term appears; and the number of ....
....stores, for each term t that appears in the collection, a list of document numbers d in which t appears together with f d,t . While large, the inverted file can be stored compressed, and modern text retrieval systems generate indexes that typically occupy 10 or less of the volume of the text. [14, 26]. The second important structure is a table of document weights computed by W d = # ( d,t ) These are precalculated and stored as part of the database. To allow measurement of the e#ectiveness of an information retrieval technique three resources are required: a corpus of text; a set of ....
[Article contains additional citation context not shown here]
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....the advent of intranets, the Internet, and online document authoring systems, there is rapid growth in both the number of users accessing text databases and in the volume of stored text. This pressure has led to the development of new, e#cient algorithms for indexing and ranked query evaluation [20, 23, 35] that allow queries to be resolved much more rapidly and with fewer resources than with the best techniques of only a few years ago. These techniques include the use of compression, to reduce the size of index and text; heuristics that drastically reduce the number of documents to be considered as ....
....user. Figure 1: Term order or TO processing for document ranking. codes. Compared to lists with fixed length fields, disk transfer costs are reduced by a factor of about three to six, and seek costs are somewhat reduced (because of the smaller number of disk tracks occupied by inverted lists) [20]. The trade o# is the requirement for decoding during query evaluation; we consider the impact of this decoding below. Using an inverted list there are in broad terms two standard strategies for evaluating a ranked query, term ordered query evaluation and document ordered query evaluation. We now ....
[Article contains additional citation context not shown here]
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....Typical Boolean queries of 3 5 terms returning a handful of documents operate with sub second response, and even ranked queries of 30 50 terms and returning a megabyte or more of text operate in just tens of seconds. A detailed description of the indexing method employed may be found elsewhere [17]; and an overview of all components of the system is provided by Witten et al. 24] This paper has concentrated on the text compression methods suitable for such a document database. Word based models have been advocated by several researchers. In large information retrieval systems of the type ....
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. Technical Report 94/2, Collaborative Information Technology Research Institute, RMIT and The University of Melbourne, February 1994.
....in disk tra#c because only part of each inverted list must be retrieved. Frequency sorting can potentially have an adverse impact on index size, because index compression techniques rely on the small di#erences between adjacent documents in longer inverted lists to achieve size reductions [1, 10]. We show, however, that it is possible to use frequency sorting to achieve a net reduction in index size, regardless of whether the index is compressed. Together, these improvements make information retrieval possible for small machines such as PCs, and for large multi user document systems such ....
....number of documents to be presented to the user) and retrieve the corresponding documents. Figure 1: Basic algorithm for computing a cosine measure in an Elias code such as the gamma code [3] Overall, such inverted index compression techniques can reduce index size by a factor of six or more [1, 10]. For a large document database indexed by an inverted file, the index can be used to simultaneously compute the cosine correlation between each document in a collection and the query as follows [4, 10, 13, 14] An accumulator is created for each document, either by initially allocating an ....
[Article contains additional citation context not shown here]
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, in press.
....be used for decompression, yielding a net reduction in time overall. Both of the major components of text retrieval systems, the index and the stored text, can be e#ectively compressed. Index compression techniques are typically based on e#cient representation of integers by variable bit codes [6, 10, 12], allowing space reduction by a factor of three to six. These compression algorithms must compete for resources with other components of the retrieval system, such as the query evaluator. In practical systems it is essential that the compression algorithms can operate in limited memory: databases ....
....hand estimation indicated that it was likely to be the most compact. In each of the two lists, the positions are sorted, allowing di#erences to be taken and representation with a variable bit Golomb code [2] this representation of positions is the same as is widely used for index compression [6, 10, 12]. However, in this context the positions tend to cluster, and alternative codes may yield better compression [5] The characters are represented with a Hu#man code. Compared to the escape method, for some words the edit method can potentially yield considerable savings. Consider (in rough ....
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....tool for increasing the sensitivity of these plagiarism detection techniques. A final consideration is e#ciency, in space and speed. With an e#cient representation, the full text index required for the identity measure is expected to occupy less than 10 of the space needed for the data itself [8, 18]. An index for the anchor strategy is likely to be even more compact. In our experiments, using code designed for flexibility (to test a range of co derivative detection techniques) rather than e#ciency, fingerprinting queries took substantially longer to evaluate than did ranked queries, but in a ....
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....that the index for Swiss PROT Release 25, a collection of around 10 Mb, requires almost 2.8 Gb of disk space, around 280 times the size of the Swiss PROT collection. 3 Indexed Genomic Retrieval with Cafe Inverted files have been shown to be a successful tool for large text database retrieval [27, 39, 54]. In such environments, indexes are used to selectively retrieve relevant records without exhaustive scanning of the database. Indexing trades space against time; for the cost of storing the index, retrieval is typically many orders of magnitude faster than an exhaustive search. To address the ....
....matching in databases of names [57, 58] To ensure e#cient query evaluation, we use a query evaluation technique adapted from such methods. A significant feature of cafe is our method of addressing the problem of index size, where we have adapted compression techniques developed for text indexing [27]. In text indexing, index size is reduced with careful compression by a factor of three to six, while in genomic databases we have found that more than three fold reductions are possible. We have previously discussed the compression of cafe indexes in a preliminary description of our approach ....
[Article contains additional citation context not shown here]
A. Mo#at and J. Zobel. Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379, October 1996.
....and the number of occurrences of each query term in each document. Both forms of query can be extended to, for example, make use of structure such as document markup [Sacks Davis et al. 1997] To allow e#cient and e#ective resolution of queries, every word occurring in every document is indexed [Mo#at and Zobel, 1996] [Persin et al. 1996] The most e#cient data structure for evaluating ranked or Boolean queries on large collections is an inverted index [Zobel et al. An inverted index consists of a vocabulary, containing each distinct word w that occurs in the database, and a set of inverted lists, one for ....
Mo#at, A. and Zobel, J. (1996). Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379.
No context found.
Moffat A and Zobel J (1996) Self-indexing inverted files for fast text retrieval. ACM Transactions on Information Systems, 14(4):349--379.
No context found.
Moffat, A. und Zobel, J. (1996). Self-indexing Inverted Files for Fast Text Retrieval, ACM Transactions On Information Systems, Vol. 14, No. 4, ACM Press, pp. 349--379. URL : http://www.acm.org/pubs/citations/journals/tois/1996-14-4/p349-moffat/
No context found.
A. Moffat and J. Zobel, Self-indexing inverted files for fast text retrieval, ACM Transactions on Information Systems, 1996, 14(4): 349-379.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC