| D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271-- 289, 1995. |
....the number of them does not. Thus burstsort and MSD radixsort have the same asymptotic computational cost as given earlier. 4 Experiments We have used three kinds of data in our experiments, words, genomic strings and web URLs. The words are drawn from the large web track in the TREC project [6, 7], and are alphabetic strings delimited by non alphabetic characters in web pages (after removal of tags, images, and other non text information) The web URLs have been drawn from the same collection. The genomic strings are from GenBank. For word and Some of these data sets are available under ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....of the e#ectiveness of an information retrieval technique three resources are required: a corpus of text; a set of test queries; and a set of relevance judgements human evaluations as to which of the sample documents are relevant to which of the queries. The data of the NIST trec project [6] supplies these three components. The experiments described below were performed on the gigabyte of data on trec disk two, and two subsets of the queries, 51 200 and 202 250. The queries were split to allow investigation of both long and short queries. In the first group, after simple ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, May 1995.
.... with the same specification as standard strcmp( yielded, for example, a reduction in total time of around 20 for periodic rotation based splaying 6RESULTS We tested the data structures described by applying them to a collection of around 1 Gb of world wide web text data derived from the TREC [9] Very Large Collection. TREC is an ongoing international collaborative experiment in information retrieval sponsored by NIST and ARPA. In all, the collection contains 171,397,973 word occurrences of 1,973,187 distinct words in 616,944 documents, with an average of 3.2 new words per document. ....
....669.4 (42.2) 6.2 COMPARISON OF STRUCTURES 11cm Overall results comparing self adjusting structures to a BST, red black tree, and hashing are shown in Table 1. In these experiments, we present average results with five di#erent text collections of around 1 Gb in size derived from the TREC [9] Very Large Collection web data. The hash table is moderate in size, containing 220,000 slots. All self adjusting trees tested use parent pointers for fast tree reorganisation, and all structures use our implementation of strcmp( We use the C library function rand( to generate priorities in the ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....In the case of English text, where frequency of occurrence of words is skewed and follows the Zipf distribution [8] vocabulary size is typically smaller than main memory. As an example, in a medium size collection of around 1 Gb of English text derived from the TREC world wide web data [2], there are around 170 million word occurrences, of which just under 2 million are distinct words. The single most frequent word, the , occurs almost 6.5 million times almost twice as often as the second most frequent word, of while there are more than 900,000 words that occur once only. ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....comparison point that is less restrictive than dict. In all classes we truncated strings at 32 characters. 3Data To examine trends in occurrences of new words we used the data gathered for the Web track of TREC, an international collaborative evaluation of information retrieval techniques [2]. We used approximately 45 Gb of the Web data, which derives from a crawl of the Web undertaken in 1997. Before extracting words from the data we preprocessed it. We eliminated material in tags, excepting comments, because such material includes long non word strings generated for purposes such ....
....explained adequately by statistics, while also faithfully viewed by others as illustration of Zipf s principle of leasthuman e#ort [6] Consider the frequency of distinct words in English text as an example. Using text from the Wall Street Journal (WSJ) distributed as part of the TREC project [2], the constant # in f r =1 (r #) has been observed to be 0.1 for the frequency of words. In WSJ, the most frequently occurring word ( the ) occurs just over twice as often as the second most frequent word ( of ) that in turn occurs 1.1 times more frequently than the next word ( to ) and so ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....results strongly suggest that taking into account even elementary structure such as paragraph or sentence boundaries degrades e#ectiveness; use of such structure assumes that it can be reliably identified, and reintroduces problems such as length normalisation. For the FR subcollection of TREC [14], which contains documents of greatly varying length, we showed that use of overlapping fixed length passages of 150 to 300 words markedly improves retrieval e#ectiveness, by up to 37 over full document ranking [17] Smaller but consistent improvements were observed for the full TREC collection, ....
....containing a reasonably large number of documents. Second is a standard set of queries. Third is, for each query, a set of human relevance judgements manual decisions as to which of the documents in the collection satisfy the information need. In this paper we use the TREC test collection [14], which consists of several gigabytes of text data, several hundred queries, and several thousand relevance judgements for each query. We focus on the data generated in the 1996 round of TREC, specifically the 2 gigabytes of data on TREC disks 2 and 4 and queries 251 300. Given a set of test ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....an even length phrase becomes an odd length phrase, and an odd length phrase an even one. 5 Experiments In our experiments, we used 981 megabytes of words extracted from the TREC Very Large Collection web data (WEB) and 508 megabytes of the Wall Street Journal (WSJ) from TREC disks 1 and 2 [6]. TREC is an ongoing international collaborative information retrieval experiment sponsored by the NIST and ARPA. The nextword index for WSJ requires 278 MB of disk space or 56 of the collection size, while the WEB index requires 696 MB or 71 of the collection size; these indexes are large ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
.... by reference to Zipf s distribution, which, while highly inaccurate as a description of real text collections, does succinctly describe the phenomenon of common words dominating as a proportion of word occurrences [21] For example, in the Wall Street Journal component of the first TREC disk [10], the word the accounts for around one word occurrence in 17, and the commonest five words ( the , of , to , a , and and respectively) account for around one word occurrence in six. A related characteristic is locality: the degree to which a recently observed word is likely to occur again, ....
....times with and without sorting after accumulation of per document data. The sort involves copying references to all list nodes into an array and sorting with a standard quicksort package. 3 Test data and methodology Several text collections taken from the large Web track in the TREC project [10] are used in our experiments. The file TREC1 contains the data on the first TREC CD ROM; the three files Web S, Web M, and Web L contain collections of web pages extracted from the Internet Archive used in the TREC experiments. We also included the file Gen of n grams of genome data, which, in ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....Word occurrences (10 ) 6 131 631 963 2,704 179 Documents 18,726 347,075 1,780,983 2,680,922 7,904,237 510,634 Parsing time (sec) 2. 4 49.4 238.0 363.9 1,014.0 69.3 4 Experiments Tes t data The principal test data we used is six data sets drawn from the large Web track in the TREC project [24]. The five files Web S , Web M , Web L, Web XL,andWeb XXL contain collections of web pages extracted from the Internet Archive for use in TREC retrieval experiments. The file TREC1 is the data on the first TREC CD ROM. The statistics of these collections are shown in Table 1. A word, in these ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....Mb. Compared to previous techniques, our approaches achieve the same compression e#ciency using models that are smaller by a factor of ten. Test data We have used several test collections in the experiments reported in this paper. Most are drawn from the data accumulated by the TREC experiments [3]. In this paper we report on the experiments with two of these collections, for which the results were representative of all the collections used. The first is partwsj, the first 63.3 Mb (1,000,000 lines of text) of the Wall Street Journal component of TREC disk 1, which contains approximately ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....of a single original document. Each was produced by a di#erent author and has significant di#erences, but, when compared, it is clear that they were derived from the same source. The remainder of the documents in this collection are articles from the Internet, extracted from the TREC web data [4]. The queries used for this collection were the ten documents. Both the query documents and the documents in the collection were reduced to strings of words with all formatting information removed. The second collection, Linux data, was a larger collection known to contain multiple versions of ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, 1995.
....the retrieval problem and are not always rich enough to distinguish between retrieval methods of di#erent power. However, their size did allow complete relevance judgements to be formed. The 1990s has seen widespread use of much larger experimental databases, in particular the TREC collection [3], which provides a more realistic test environment but prohibits comprehensive relevance assessment. With such collections it is necessary to use techniques such as pooling to identify documents to be considered for relevance assessment, but it is possible for pooling to introduce bias. For ....
....pool depth is increased. These results suggest a variation on standard pooling strategies that can increase the number of relevant documents discovered for given judgement e#ort, without introducing bias. 2TestData For the results in this paper we have used the data generated by the TREC project [3], managed by NIST. In TREC, each participating group is given the same data and queries, and returns to NIST their runs, a listing of the identifiers of the top ranked documents for each query. Each run contains up to 1000 identifiers. For each query, the top 100 identifiers from each run are ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289, May 1995.
....of at least one word; output is either the number of matching instances (documents and occurrences within documents) the list of nextwords for the phrase; or the documents with the matching instances. As test data we used the 508 megabytes of the Wall Street Journal (WSJ) from TREC disks 1 and 2 [Harman, 1995]. TREC is an ongoing international collaborative experiment in information retrieval sponsored by NIST and ARPA. As test queries we generated lists of phrases as follows. From WSJ we randomly extracted 100 two and five word phrases (200 phrases in total) giving query sets W2 and W5; in these ....
Harman, D. (1995). Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289.
....main advantage of path indexes is the eciency of query evaluation. Consider example a query to retrieve the documents where a given element contains a particular word. With path indexes, a single inverted list is retrieved, whose length is determined by the frequency of the word. In the TREC data [8], for example, a typical query term occurs in around 0.1 of the stored documents and the inverted list would occupy a small number of disk blocks. For position based indexes, two inverted lists must be retrieved, one for the element and one for the word, and these lists must be intersected. The ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271-289, 1995.
....fusion) we have implemented a retrieval model based on the logistic regression methodology. Participation: Category B, ad hoc automatic Introduction There exist many reasons for considering multiple sources of evidence in information retrieval (Katzer et al. 1982) Saracevic Kantor, 1988) (Harman, 1995), and their integration is usually studied in two distinct contexts. Various retrieval strategies or query formulations may operate on the same collection (data fusion problem) Belkin et al. 1995) Lee, 1995) subject described in the first part. The second part deals with the collection fusion ....
Harman, D. (1995). Overview of the second text retrieval conference (TREC-2).
....on the retrieval method (what IR data is to be stored) and on the retrieval algorithm (how is the IR data accessed) The primary goal of an IR search is a high retrieval effectiveness. We compared several retrieval methods that performed very well at the latest TRECs (Text REtrieval Conferences) [Har96, Har95, Har94]. At TRECs, different IR research groups run their retrieval methods against a large collection of textual data. Looking at some successful ad hoc methods [CCB94, CCG94, RWJ 95, BSM96] the following IR data that can be precomputed (and stored) has been used: n number of documents ff( i ; d j ) ....
Harman, D. (1994). Overview of the Second Text REtrieval Conference (TREC-2). In Proceedings of TREC-2, Special Publication 500-215, pp. 1--20.
....document in the collection manually against each of the test queries. Assessing precision at a given level of returned documents is easier, but still a considerable undertaking. One of the most important outcomes of TREC, a large scale international project involving many prominent research groups (Harman, 1995), is a large set of queries and relevance judgements against a particular multi gigabyte document collection. These form an invaluable resource, and considerable improvement in retrieval technology, as measured by recall and precision averages, has been reported at TREC conferences over a ....
Harman, D. (1995). Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271--289.
....that these terms appear with P1: VTL RKB P2: PMR TKL P3: PMR TKL QC: PMR AGR T1: PMR Machine Learning KL459 02 Pazzani June 17, 1997 17:23 322 M. PAZZANI AND D. BILLSUS similar frequencies in uninteresting pages) The weighted fraction used (0. 25) was empirically determined reasonable in TREC 2 (Harman, 1994). Pages within a certain distance of the prototype (as determined by the cosine similarity measure) are considered interesting. A distance threshold is chosen that maximizes the accuracy on the training set. 3.5. Neural nets We used two approaches to learning with neural nets. In the perceptron ....
Harman, D.K. (1994). Overview of the second Text Retrieval Conference (TREC-2). Proceedings of the Second Text Retrieval Conference (TREC-2), NIST Special Publication.
....For convenience, we collect notations used in this paper into table 1. They will be introduced when they are used. In our research, we use two data collections, to test and verify our results. The two data collections are the CACM collection and a document subset extracted from the TREC collection [4]. A short description of the two collections are given in the appendix. 2 Fast Heuristic Search Algorithms There have been extensive research done on how to improve the search reduction ratio. Most notable ones are key based partitioning methods like the Floating key Partitioning [5, 6] and ....
HARMAN, D. Overview of the second text retrieval conference (trec-2). In Proceedings of the 2nd Text Retrieval Conference (Gaithersburg, Md., Aug., 1993b), pp. 1--20. 20
....schemes we constructed three test collections of text and weather data. The smallest file in our test collection, smalltrec, is 2.86 Mb of text data taken from the TREC collection. TREC is an ongoing international collaborative experiment in information retrieval sponsored by NIST and ARPA [4]. The weather data (weather) contains 20,175 records collected from each of 5 weather stations, where each station record contains 4 sets of 22 measurements (such as temperatures, elevations, rainfall, and humidity) in total weather is 38.2 Mb in size. Comact, the largest file in the test ....
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, Volume 31, Number 3, pages 271--289, 1995.
....2.3 Information retrieval evaluation Generally, one of the main tasks of evaluating IR systems is to obtain information about the satisfaction of the user s task in a specific work environment. Traditional IR experiments have been carried out for almost forty years such as the Cranfield and TREC (Harman, 1995) studies. Studies conductedby Robertson and HancockBeaulieu (1992) and Su (1992) investigate user behaviour, interaction and IR systems. Within HCI research, there has been extensive work within the usability 4 evaluation area. To begin with, we need to make a distinction between formative and ....
Harman, D. (1995), Overview of the second text retrieval conference. (TREC-2). Information Processing & Management, Vol. 31 (3), 271-289.
....a standard test collection (consisting of documents, queries, and manual relevance assessments for each query) and computing an average effectiveness across all queries. A great many different scoring functions for ranking have been proposed [20] and, as initiatives such as the TREC experiments [6] have shown, different systems can be of similar effectiveness, yet retrieve very different sets of answers [19] Most previous investigations of stemming have focused on the impact on effectiveness of changing the stemming mechanism [5, 7, 9, 10, 18] Harman [5] evaluated the performance of the ....
....additional linguistic knowledge. Last, Jacquemin has described a truncationbased technique for conflating similar phrases [8] 4 Test data We chose to test the stemmers by examining the conflation sets they produce for a real text database. The database used was the second disk of TREC data [6], which is one gigabyte of text with a vocabulary of 363,553 distinct terms after case folding. Rather than examine the conflations of all terms, we restricted our attention to terms occurring in TREC queries, both to limit the resources required to complete this work and to obtain a realistic ....
Donna Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, Volume 31, Number 3, pages 271--289, 1995.
No context found.
D. Harman. Overview of the second text retrieval conference (TREC-2). Information Processing & Management, 31(3):271-- 289, 1995.
No context found.
Donna K. Harman. Overview of the second text retrieval conference. In The Second Text REtrieval Conference, pages 1--20, Gaithersburg, Maryland, USA, August 31-September 2 1993. National Institute of Standards and Technology.
No context found.
Harman, D. (1995a). Overview of the second text retrieval conference (TREC-2).
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC