| D. Hawking. Overview of trec-7 very large collection track. In E. M. Voorhees and D. K. Harman, editors, The Seventh Text REtrieval Conference, 1998. |
....and require huge resources, often taking days to complete. As a measure of comparison with traditional IR systems, our 40 million page WebBase repository (Section 3. 3) represents only about 4 of the publicly indexable Web but is already larger than the 100 GB very large TREC 7 collection [34], the benchmark for large IR systems. In addition, since content on the Web changes rapidly (see Section 1) periodic crawling and rebuilding of the index is necessary to maintain freshness. Index rebuilds become necessary because most incremental index update techniques perform poorly when ....
D. Hawking and N. Craswell. Overview of TREC-7 very large collection track. In Proc. of the Seventh Text Retrieval Conf., pages 91--104, November 1998. 40
....instead of accumulating weights, it is necessary to construct a temporary inverted list for the phrase, by fetching the inverted list of each of the individual terms and combining them. If the inverted list for Matthew is as above and the inverted list for Richardson is 1, 7, 52] 2, 12, [1, 4] 1, 44, 83] then both words occur in document 7 and as an ordered pair. Only Richardson occurs in document 12, both words occur in document 44 but not as a pair, and only Matthew occurs in document 117. The list for Matthew Richardson is therefore 1, 7, 51] After this, the ....
....that the number of fetches from memory to cache during decompression is halved. Small collection Figure 2 shows the relative performance of the integer compression schemes we have described for storing o#sets, on a 500 Mb collection of 94,802 Web documents drawn from the TREC Web track data [4]; timing results are averaged over 10,000 ranked queries drawn from an Excite search engine query log [11] The index contains 703, 518 terms. These results show the e#ect of varying the coding scheme used for document numbers d, frequencies f d,t ,ando#setso. In all cases where both bitwise and ....
[Article contains additional citation context not shown here]
D. Hawking, N. Creswell, and P. Thistlewaite. Overview of TREC-7 very large collection track. In E. Voorhees and D. Harman, editors, Proc. Text Retrieval Conference (TREC), pages 91--104, Washington, 1999. National Institute of Standards and Technology Special Publication 500-242.
....most commercial search engines, a combination text and link based methods are employed. 1 IR systems. As a measure of comparison, the 40 million page WebBase repository [10] represents only about 4 of the publicly indexable web but is already larger than the 100 GB very large TREC 7 collection [9], the benchmark for large IR systems. Rate of change Since the content on the Web changes extremely rapidly [5] there is a need to periodically crawl the Web and rebuild the inverted index. Many techniques for incremental update of inverted indexes reportedly perform poorly when confronted with ....
....and the value. Table 5 shows that the mixed list storage scheme scales very well to large collections. The size of the index is consistently about 7 the size of the input HTML text. This compares favorably with the sizes reported for the VLC2 track (which also used crawled web pages) at TREC 7 [9] where the best reported index size was approximately 7:7 the size of the input HTML. Our index sizes are also comparable to other recently reported sizes for non Web document collections using compressed inverted les [18] 0 2 4 6 8 10 x 10 5 0 200 400 600 800 1000 1200 Length of the ....
D. Hawking and N. Craswell. Overview of TREC-7 very large collection track. In Proceeedings of the Seventh Text Retrieval Conference, pages 91-104, November 1998. 20
.... becoming more vulnerable to system failures) As a measure of comparison, the 40 million page (220 GB) WebBase repository [8] represents only about 4 of the estimated size of the publicly indexable Web as of January 2000 [24] but is already larger than the 100 GB very large TREC 7 collection [7], the benchmark for large IR systems. Rate of change. Since the content on the Web changes extremely rapidly [3] there is a need to periodically crawl the Web and update the inverted index. Indexes can either be updated incrementally or periodically rebuilt, after every crawl. With both ....
....from Figure 7. Table 3 shows that the mixed list storage scheme scales very well to large collections. The size of the index is consistently below 7 the size of the input HTML text. This compares favorably with the sizes reported for the VLC2 track (which also used crawled Web pages) at TREC 7 [7] where the best reported index size was approximately 7.7 the size of the input HTML. Our index sizes are also comparable to other recently reported sizes for non Web document collections using compressed inverted files [16] Note that exact index sizes are dependent on the type and amount of ....
D. Hawking and N. Craswell. Overview of TREC-7 very large collection track. In Proc. of the Seventh Text Retrieval Conf., pages 91--104, Nov 1998.
.... becoming more vulnerable to system failures) As a measure of comparison, the 40 million page (220 GB) WebBase repository [8] represents only about 4 of the estimated size of the ######## ######### ### as of January 2000 [24] but is already larger than the 100 GB #### ##### ###### ########## [7], the benchmark for large IR systems. Rate of change. Since the content on the Web changes extremely rapidly [3] there is a need to periodically crawl the Web and update the inverted index. Indexes can either be updated incrementally or periodically rebuilt, after every crawl. With both ....
....from Figure 7. Table 3 shows that the mixed list storage scheme scales very well to large collections. The size of the index is consistently below 7 the size of the input HTML text. This compares favorably with the sizes reported for the VLC2 track (which also used crawled Web pages) at TREC 7 [7] where the best reported index size was approximately 7#7 the size of the input HTML. Our index sizes are also comparable to other recently reported sizes for non Web document collections using compressed inverted les [16] Note that exact index sizes are dependent on the type and amount of ....
D. Hawking and N. Craswell. Overview of TREC-7 very large collection track. In ##### ## ### ####### #### ### ####### #####, pages 91-104, Nov 1998.
.... link information in some form [1, 25] Howmuch more e ective are link based methods in the web environment as compared to a state of the art keywordbased method developed for the TREC ad hoc task This question has been studied in a limited number of studies, especially under TREC s web track [5, 6, 7]. The results from these studies indicate that for web search, link based methods do not hold anyadvantage over the state of the art keyword based methods developed for TREC ad hoc search. These results are quite counter intuitive given the general wisdom in the web search community that some kind ....
....keyword based methods developed for TREC ad hoc search. These results are quite counter intuitive given the general wisdom in the web search community that some kind of linkage analysis does improveweb page site ranking. Our work is motivated by this discrepancy between the results presented in [5, 6, 7], and the general belief in the web search community. Di erentweb search engines make competing claims regarding their coverage and search e ectiveness. In this study, we don t concentrate on comparing the search e ectiveness of di erent web search engines. There have been several studies that ....
[Article contains additional citation context not shown here]
D. Hawking, N. Craswell, and P. Thistlewaite. Overview of the TREC-7 very large collection track. In E.M. Voorhees and D.K. Harman, editors, ########### ## ### ####### #### ######### ########## ########, pages 91-104. NIST Special Publication 500-242, July 1999.
....Internet traffic. Although we also adopt a replication hierarchy to store the documents of the most frequently used queries, our replicas are searchable and thus can speed up both query and document processing. InQuery, our base system, is not the fastest text retrieval system available today [17]. For the experiments in this paper, we model it and validate against a 3 processor 250MHz Alpha which can maintain response times of under 10 seconds with 4 to 5 disks on a collection size of up to 16 GB for a heavily loaded system. Other multiprocessor systems[17] have recently reported results ....
....retrieval system available today [17] For the experiments in this paper, we model it and validate against a 3 processor 250MHz Alpha which can maintain response times of under 10 seconds with 4 to 5 disks on a collection size of up to 16 GB for a heavily loaded system. Other multiprocessor systems[17] have recently reported results for a single query (rather than a loaded system) that are less than a second on 100 GB collection. We believe our results on replication versus partitioning are directly applicable to these systems as well, although they store more data on a single machine. ....
David Hawking, Nick Craswell, and Paul Thistlewaite. Overview of trec-7 very large collection track. In Proceedings of the 7th Text Retrieval Conference, 1998.
....the user interface plays an even more critical role. Some efforts are underway to address this problem. Harman and Over [8] sponsored a panel on tools for searching the Web that provides some insight into the complex process of answering questions on the Web. The Very Large Corpus track at TREC [9] is based heavily on Web data and may evolve into a Web track. Much work remains, however. 4 Conclusions The SIGIR 98 Workshop on Hypertext Information Retrieval for the Web achieved its goal of bringing together researchers and practitioners to discuss current issues and approaches related to ....
D. Hawking and P. Thistlewaite. Overview of trec-6 very large collection track. In E. M. Voorhees and D. K. Harman, editors, The Sixth Text REtrieval Conference (TREC-6), pages 93--105, Gaithersburg, MD, 1998. National Institute of Standards and Technology Special Publication 500-240. http://trec.nist.gov/pubs/trec6/papers/vlc track.ps.
....required to reliably evaluate new search techniques and to perform repeatable experiments in the context of the World Wide Web. The track used a frozen snapshot of the web as its document collection. This collection, known as the VLC2 collection and used in last year s Very Large Collection track [5], is over 100 gigabytes and represents some 18.5 million web pages. The track defined two subtasks, the small web and the large web tasks, based on the amount of the web data used. The small web task used a 2 gigabyte, 250,000 document subset of the VLC2 collection, while the large web task used ....
David Hawking, Nick Craswell, and Paul Thistlewaite. Overview of the TREC-7 very large collection track. In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 91--103, August 1999. NIST Special Publication 500-242. Electronic version available at http://trec.nist.gov/pubs.html.
....more servers [5, 12, 17, 20] Only Couvreur et al. 9] and Cahoon et al. 6, 7] use simulation to experiment with more than 100 GB of data. None of these previous studies include partial replication or caching. InQuery, our base system, is not the fastest text retrieval system available today [13]. We model and validate against a 3 processor 250MHz Alpha which can maintain response times of under 10 seconds with 4 to 5 disks on a collection size of up to 16 GB for a heavily loaded system. Other multiprocessor systems [13] have recently reported results for a single query (rather than a ....
....base system, is not the fastest text retrieval system available today [13] We model and validate against a 3 processor 250MHz Alpha which can maintain response times of under 10 seconds with 4 to 5 disks on a collection size of up to 16 GB for a heavily loaded system. Other multiprocessor systems [13] have recently reported results for a single query (rather than a loaded system) that are less than a second on a 100 GB collection. We simulate such response times, and find similar trends. We report some of these results in Section 5. We thus believe our results on replication, caching, and the ....
[Article contains additional citation context not shown here]
D. Hawking, N. Craswell, and P. Thistlewaite. Overview of TREC-7 very large collection track. In Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 91--104, Gaithersburg, MD, 1998.
.... The previous research on SMPs investigated a part of system, such as the disk system [47] or it compared the cost factors of the SMP architecture with other architectures [20] Recently the TREC conference reported results on a single query (rather than a loaded system) against a 100 GB collection [42], where some of its participated institutions used SMPs. None of the commercial systems and the previous research on SMPs reported how to balance hardware and software resources in order to achieve scalable performance in a SMP system. In this dissertation, we conduct a systematic study on a ....
....based on term identifiers performs the best when the term distribution is less skewed (i.e. when the term distribution in the query is uniformly distributed) Partitioning based on document identifiers performs the best when term distribution is highly skewed. Recently the TREC conference [42] reported results for SMPs to process a single query (rather than in a loaded system as we do here) on a 100 GB collection. Their fastest SMP system used a 8 CPU 266 MHZ Alpha with 8 disks and achieved the response time for a single query less than 2 second. In this report, the InQuery system used ....
Hawking, David, Craswell, Nick, and Thistlewaite, Paul. Overview of trec-7 very large collection track. In Proceedings of the 7th Text Retrieval Conference (1998).
....for searching a terabyte of text in distributed information retrieval systems. None of these previous studies include partial replication, and of course they do not compare database partitioning with replication. InQuery, our base system, is not the fastest text retrieval system available today [17]. For the experiments in this paper, we model and validate against a 3 processor 250MHz Alpha which can maintain response times of under 10 seconds with 4 to 5 disks on a collection size of up to 16 GB for a heavily loaded system. Other multiprocessor systems [17] have recently reported results ....
....retrieval system available today [17] For the experiments in this paper, we model and validate against a 3 processor 250MHz Alpha which can maintain response times of under 10 seconds with 4 to 5 disks on a collection size of up to 16 GB for a heavily loaded system. Other multiprocessor systems [17] have recently reported results for a single query (rather than a loaded system) that are less than a second on 100 GB collection. We have simulated such response times, and found the same trends we report here. We thus believe our results on replication versus partitioning are directly applicable ....
David Hawking, Nick Craswell, and Paul Thistlewaite. Overview of TREC-7 very large collection track. In Proceedings of the 7th Text Retrieval Conference, 1998.
....Either a document contained relevant content and was judged relevant or it was judged irrelevant. To ensure consistency of results, all the documents retrieved for a query were judged by the same person. Each document retrieved in response to a query was judged by only one judge, as earlier work ([9] and [21] failed to demonstrate any particular benefit of multiple judgments. 3.5 Our Relevance Assessment Tool We used special relevance judging software known as the RAT (Relevance Assessment Tool) which was developed by Jason Haines and Paul Thistlewaite in 1996 for use in the TREC Very ....
David Hawking and Paul Thistlewaite. Overview of TREC-6 Very Large Collection Track. In E. M. Voorhees and D. K. Harman, editors, Proceedings of TREC-6, pages 93--106, Gaithersburg MD, November 1997. NIST special publication 500-240, http://trec.nist.gov.
....all five search engines performed below the median P 20 for title only VLC2 submissions and substantially below the medians for the longer topic runs. The median performance of the VLC2 groups increases sharply with increasing use of topic words. A full report of the TREC 7 VLC track is available [9]. 3.4. Discussion of results Since Web search engines search varying samples of the Web [3,17] and the Internet Archive snapshot is different again, we cannot compare the effectiveness of ranking algorithms in isolation, but only the effectiveness of each combination of spidering run and ranking ....
D. Hawking, N. Craswell and P. Thistlewaite, Overview of TREC-7 very large collection track, in: E.M. Voorhees and D.K. Harman (Eds.), Proc. 7th Text Retrieval Conference (TREC-7), Gaithersburg, MD, U.S. National Institute Standards Technology, NIST Special Publication 500-?, 1998.
....combinations, the explanation of the observed poorer performance by the search engines is unlikely to lie in their use of larger data sets than the TREC systems. On the contrary, experiments with scaling up collections have consistently shown an increase in P 20 with increasing collection size [8,11]. It is also unlikely that the poorer performance was due to the shortness of the queries submitted to the search engines. First, it is not clear that better results would have been obtained by feeding more of the topic description to the search engines. Second, as shown in Table 4, the median of ....
....and alternative methods such as TREC pooling [21] are unlikely to be effective over that amount of data. 4.4. Speed measurement The Web snapshot was used in a Very Large Collection special interest track of TREC 7 in which speed and scalability of both query processing and indexing were measured [8]. One participating group (the MultiText project, based at the University of Waterloo [7] demonstrated an indexing rate of almost 20 gigabytes per hour, coupled with sub second query processing rates and better effectiveness than popular search engines, using less than US10,000 of hardware. The ....
D. Hawking and P. Thistlewaite, Overview of TREC-6 very large collection track, in: E.M. Voorhees and D.K. Harman (Es.), Proc. 6th Text Retrieval Conference (TREC-6), Gaithersburg, MD, U.S. National Institute Standards Technology, NIST Special Publication 500-240, pp. 93--106, 1997.
....and Information Sciences, CSIRO david.hawking cmis.csiro.au 1 Introduction Experiments using TREC style topic descriptions and relevance judgments have recently been carried out for the rst time over real Web data. One interesting result is that systems of TREC VLC Track participants [5] were more e ective than live Web systems [6] A number of factors could explain this, but one important possibility is that Web and TREC ad hoc systems are solving di erent problems. If this were the case, it would be unfair to use TREC ad hoc evaluation methods as a benchmark for the success of ....
David Hawking, Nick Craswell, and Paul Thistlewaite. Overview of TREC-7 Very Large Collection Track. In Voorhees and Harman [8]. NIST special publication 500-?
....combinations, the explanation of the observed poorer performance by the search engines is unlikely to lie in their use of larger data sets than the TREC systems. On the contrary, experiments with scaling up collections have consistently shown an increase in P 20 with increasing collection size. [11, 9] It is also unlikely that the poorer performance was due to the shortness of the queries submitted to the search engines. First, it is not clear that better results would have been obtained by feeding more of the topic description to the search engines. Second, as shown in Table 4, the median of ....
David Hawking and Paul Thistlewaite. Overview of TREC-6 Very Large Collection Track. In Voorhees and Harman [21], pages 93--106. NIST special publication 500-240.
....all five search engines performed below the median P 20 for title only VLC2 submissions and substantially below the medians for the longer topic runs. The median performance of the VLC2 groups increases sharply with increasing use of topic words. A full report of the TREC 7 VLC track is available. [8] 3.4 Discussion of Results Since Web search engines search varying samples of the Web [17, 3] and the Internet Archive snapshot is different again, we cannot compare the effectiveness of ranking algorithms in isolation, but only the effectiveness of each combination of spidering run and ranking ....
....and alternative methods such as TREC pooling [21] are unlikely to be effective over that amount of data. 4.4 Speed Measurement The Web snapshot was used in a Very Large Collection special interest track of TREC 7 in which speed and scalability of both query processing and indexing were measured. [8] One participating group (the MultiText project, based at the University of Waterloo [7] demonstrated an indexing rate of almost 20 gigabytes per hour, coupled with sub second query processing rates and better effectiveness than popular search engines, using less than US10,000 of hardware. ....
David Hawking, Nick Craswell, and Paul Thistlewaite. Overview of TREC-7 Very Large Collection Track. In Voorhees and Harman [22]. NIST special publication 500-?
No context found.
D. Hawking. Overview of trec-7 very large collection track. In E. M. Voorhees and D. K. Harman, editors, The Seventh Text REtrieval Conference, 1998.
No context found.
D. Hawking, N. Craswell, and P. B. Thistlewaite. Overview of TREC-7 very large collection track. In The Seventh Text REtrieval Conference (TREC-7), pages 91--104, Gaithersburg, Maryland, USA, November 1998.
No context found.
D. Hawking, N. Craswell, and P. B. Thistlewaite. Overview of TREC-7 very large collection track. In The Seventh Text REtrieval Conference (TREC-7), pages 91--104, Gaithersburg, Maryland, USA, November 1998.
No context found.
D. Hawking, N. Craswell, and P. Thistlewaste. Overview of the trec-7 very large collection track. In Proceedings of the TREC-7, 1998.
No context found.
D. Hawking, N. Craswell, and P. Thistlewaite. Overview of the TREC-7 very large collection track. In Seventh Text REtrieval Conference, November 1998.
No context found.
Hawking, D., Craswell, N., and Thistlewaite P. Overview of TREC-7 Very Large Collection Track. In Proceedings of TReC7 (Gaithersburg, MD, 1998), NIST Special Publication 500-242, 91-104.
No context found.
David Hawking, Nick Craswell, and Paul Thistlewaite. Overview of TREC-7 very large collection track. In E.M. Voorhees and D.K. Harman, editors, Proceedings of the Seventh Text REtrieval Conference (TREC-7), pages 91-103, August 1999. NIST Special Publication 500-242. Electronic version available at http://trec.nist.gov/pubs.html.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC