| G. Salton, "Developments in Automatic Text Retrieval", Science, Vol. 253, pages 974979, 1991. |
....value for an example is the set of tokens that are present in that field for this example. The information retrieval community has similarly spent many years developing robust retrieval methods applicable to many retrieval tasks concerning text containing documents, with vector space methods [Sal91] being the best known examples of techniques in this area. Although for many years the focus has been primarily on retrieval tasks, here, too, the last decade has seen a significant increase in interest in the use of such methods for text classification tasks. The most common techniques use the ....
Gerard Salton. Developments in automatic text retrieval. Science, 253:974--979, 1991.
....topics ranked according to their relevance. However, as discussed above, the feedback for each document is the set y and thus the topics for each document in the training corpus are not ranked but rather marked as relevant or non relevant. Each document is represented using the vector space model [9] as vectors in R . We denote a document by its vector representation x 2 R . All the topic ranking algorithms we discuss in this paper use the same mechanism: each algorithm maintains a set of k prototypes, w1 ; w2 ; wk . Analogous to the representation of documents, each ....
G. Salton. Developments in automatic text retrieval. Science, 253:974-980, 1991.
....categorizing this text becomes more important. A variety of recent work has demonstrated the success of statistical approaches for learning to classify text documents [ Joachims, 1997; Koller and Sahami, 1997; Yang and Pederson, 1997; Nigam et al. 1998 ] These approaches, such as TFIDF [ Salton, 1991 ] and naive Bayes [ Lewis and Ringuette, 1994 ] typically represent documents as vectors of words, and learn by gathering statistics from the observed frequencies of these words within documents belonging to the different classes. Because they rely on these learned word statistics, these ....
G. Salton. Developments in automatic text retrieval. Science, 253:974--979, 1991.
....to a server. The server records the submitter and time of submission. It automatically assigns the nugget a headline; the nugget headline is the page title for URLs or the first few words of the nugget for text data. It also selects relevant keywords for the nugget using Salton s TF IDF algorithm [22] and revisits each URL every few weeks to recompute the keywords. Finally, if the user specified a category or a relationship with another nugget, the server records this information as well. We considered allowing users to manually specify the keywords and headline. We also debated about whether ....
Salton, G. (1991). Developments in Automatic Text Retrieval. Science, Vol 253, pp. 974-97.
.... K V (4) B XW Y4 F [ZC ] This distance is normalized by its value on an empty cache: G G (5) The decision for assigning one or two labels is made by taking the one or two lowest distances (eq. 5) 3. 3 The TFIDF Classifier The TFIDF classifier [7] represents topics as vectors. Each one is characterized by a set of distinct words Pb Pc is the number of words of the topic and the weight of word . is defined as d 2 2 , where 2 is the term frequency, i.e. the number of times the ....
G. Salton. Developments in automatic text retrieval. Science, 253:974--980, 1991.
....in the respective category. For Pr(wil ) and Pr(wil ) the so called Laplace estimator is used [Joachims, 1997] 4. 2 Rocchio Algorithm This type of classifier is based on the relevance feedback algorithm originally proposed by Rocchio [Rocchio, 1971] for the vector space retrieval model [Salton, 1991]. It has been extensively used for text classification. First, both the normalized document vectors of the positive examples as well as those of the negative examples are summed up. The linear component of the decision rule is then computed as i e( fi[ l[ 4 (19) Rocchio requires that negative ....
Salton, G. (1991). Developments in automatic text retrieval. Science, 253:974-979.
....Nonetheless, it is interesting to ask what are the performance using di erent representations. Although this is not the focus of this work we performed additional experiments regarding this issue. A well known approach in the context of text classi cation is the tf idf representation [12]. Speci cally, each word count is multiplied by the inverse document frequency of the word, de ned by idf(w) log( D D(w) where D(w) is the number of documents in which w occurred. Rich empirical evidence support the bene ts of using this representation for text processing applications. ....
G. Salton. Developments in Automatic Text Retrieval. Science, Vol. 253, pages 974-980, 1990.
....application, the data set, and the performance measures. In section 4 we describe the results, and in section 5, the conclusions and comments on future work. 2. Vector models and Classification. Vector models have been successful in information retrieval (IR) and text classification (TC) Salton [7], Salton [8] Vector models are based on the basic assumption that a document can be represented as a vector, dismissing the order of words and other grammatical issues, and that this representation is a ble to retain enough useful information. In order to reduce the number of distinct terms in ....
Salton, G. (1991). Developments in automatic text retrieval. Science , 253, 974--979.
....predefined categories to text documents. Most techniques used to tackle this problem are based on the assumption that a document can be represented as a vector, dismissing the order of words and other grammatical issues, and that this representation is able to retain enough useful information[8, 9]. Thus, document classification can be thought of as a problem of mapping the vector space corresponding to the input documents to the space of output classes, which allows the use of standard statistical classification methods[1, 2] and machine learning techniques to solve it. The dimension of ....
....and present the results obtained. Finally, in the last section (Section 5) we draw some conclusions. 2 Vector Models, Weighting and Feature Selection As stated in the introduction, most statistical techniques used in TC are based on the assumption that a document can be represented as a vector[8, 9]. Figure 2 shows how a vector model can be constructed in a simple toy example. Instituto 1 0 0 Fisica 1 1 1 Rosario 1 0 0 Laboratorio 0 1 1 queda 0 0 1 piso 0 0 1 D1 D2 D3 Doc 2. Laboratorio de Fisica (classes C1, C3) Doc 1. Instituto de Fisica Rosario (classes C1,C2) Doc 3 El ....
G. Salton. Developments in automatic text retrieval. Science, 253:974--979, 1991.
....section, we include a number of useful techniques for detecting and extracting salient elements from unstructured text. Interrelationships between these salient elements will form the basis for the visualization of knowledge structures. 2.1. Information Retrieval Models The vector space model [27] originally developed for information retrieval, is a widely used framework for indexing documents based on term frequencies. In this model, each document d is represented by a vector V of terms t s. Terms are weighted to indicate how important they are in representing the document. The distance ....
....their relative prevalence in the collection. Thematic peaks and valleys in ThemeView produce a simplified representation of the complex content of a document corpus. Figure 7. Valleys and peaks in ThemeView. Pacific Northwest National Laboratory) The SPIRE used the classic vector space model [27]. The greatest advantage of the ThemeView approach over the use of traditional high dimensional vector spaces is that the user is able to establish connections easily between the construction and the final visualization. In particular, their procedure usually results in 300 500 nouns to be ....
Salton, G. Developments in automatic text retrieval. Science, 253. 974-980.
No context found.
G. Salton, "Developments in Automatic Text Retrieval", Science, Vol. 253, pages 974979, 1991.
No context found.
Salton, G. (1991). Developments in automatic text retrieval. Science, 253:974-979.
No context found.
G. Salton, "Developments in automatic text retrieval", Science, 253, pp. 974#/979, 1991.
No context found.
G. Salton. Developments in automatic text retrieval. 253:974--979, 1991.
No context found.
G. Salton, `Developments in automatic text retrieval', Science, 253, 974--979, (1991).
No context found.
Gerald Salton. Developments in automatic text retrieval. Science, Number 253, pages 974-- 979, 1991.
No context found.
Gerald Salton. Developments in automatic text retrieval. Science, (253):974--979, 1991.
No context found.
G. Salton, Developments in automatic text retrieval, Science 253 (1991) 974--979.
No context found.
Salton G., Development in Automatic Text Retrieval. Science, 253, 974-979.
No context found.
Gerard Salton. Developments in automatic text retrieval. Science, 253:974--980, 1991.
No context found.
G. Salton. Developments in automatic text retrieval. Science, 253:974--980, 1991.
No context found.
G. Salton. Developments in automatic text retrieval. Science, 253:974--980, 1991.
No context found.
Gerard Salton. Developments in automatic text retrieval. Science, 253:974--980, 1991.
No context found.
G. Salton. Development in automatic text retrieval. Science, 253:974-980, 1991.
No context found.
G. Salton. Developments in Automatic Text Retrieval. Science, Vol. 253, pages 974--980, 1990.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC