| David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301-315, Las Vegas, US, 1995. |
....As with MMP these algorithms use the same pivoted length normalization as their vector space representation and employ the same form of categoryranking by using a set of prototypes w1 ; wk . Rocchio: We implemented an adaptation of Rocchio s method as adapted by Ittner et . al [5] to text categorization. In this variant of Rocchio the set of prototypes vectors w1 ; wk are set as follows, w def = max 8 0; jRr j i2Rr jR r j i2R 9 = where Rr is the set of documents which contain the topic r as one of their relevant topics and R r is ....
....w def = max 8 0; jRr j i2Rr jR r j i2R 9 = where Rr is the set of documents which contain the topic r as one of their relevant topics and R r is its complement, i.e. all the documents for which r is not one of their relevant topics. Following the parameterization in [5], we set = 16 and = 4. Last, as suggested by Amit Singhal in a private communication, we normalize all of the prototypes to a unit norm. Perceptron: We also implemented the Perceptron algorithm. Since the Perceptron algorithm is designed for binary classi cation problems, we decomposed the ....
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301{ 315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
....obtaining a linear classi er is logistic regression. Logistic regression is closely related to support vector machines, which have recently gained much popularity. There have been a long history of using logistic regression in information retrieval, as can be seen from the following partial list [2, 5, 6, 10, 13, 20]. However, for a number of reasons, the method was not used in an e ective way for text categorization. As a result, the comparison in [20] suggested negative opinions on the performance of logistic regression. The combination of the following factors could have led to the negative results in ....
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301-315, 1995.
....This method brings instance based learning closer to most other classifier induction methods, in which negative training instances play a fundamental role in the individuation of a best decision surface (i.e. classifier) that separates positive from negative instances. Even methods like Rocchio [3, 4], in which 4 Microaveraging Macroaveraging Proportional CSV Proportional CSV thresholding thresholding thresholding thresholding k Re P r F1 Re P r F1 Re P r F1 Re P r F1 05 .711 .823 .763 .682 .419 .519 .545 .716 .512 .563 .763 .544 10 .718 .830 .770 .676 .418 .517 .557 .721 .524 ....
....parameter value # = 1, which places equal emphasis on P r and Re. 4. 2 Feature selection experiments We have performed our feature selection experiments first with the standard k NN classifier of Section 3 (with k = 30) and subsequently with a Rocchio classifier we have implemented following [3, 4] (the Rocchio parameters were set to # = 16 and # = 4; see [3, 4, 12] for a full discussion of the Rocchio method) In these experiments we have compared two baseline feature selection functions, i.e. # avg (t k ) m X i=1 #(t k , c i ) P (c i ) # 2 max (t k ) m max i=1 # 2 (t k ....
[Article contains additional citation context not shown here]
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, US, 1995.
....significant time to classify a single data point [Joa98] The following methods were specifically tested for automated e mail classification. Coh96] compares results for e mail classification of a new rule induction method and adaptation of Rocchio s relevance feedback algorithm [Roc71] in [ILA95] SDHH98] employs Nave Bayes classifier to filter junk e mail. Boo98] uses a combination of nearest neighbor and TF IDF approaches. Nave Bayes classifier is used for classifying e mail into multiple categories in [Ren00] Support Vector Machines approach is implemented for e mail authorship ....
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Proc. 4th Annual Symposium Document Analysis and Information Retrieval (SDAIR'95), pages 301--315, Las Vegas, US, 1995.
....relevance feedback method [13] on a pre classified set of documents (training set) Rocchio s algorithm is a well known algorithm in the IR community, traditionally used for relevance feedback. Classifiers based on Rocchio have proven to be quite effective in filtering [16] and classification [12, 8] tasks. When training documents are to be ranked for a topic, an ideal classifier should rank all relevant documents above the non relevant ones. However, such an ideal classifier might just not exist, therefore, we settle for a classifier that maximizes the difference between the average score ....
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text Categorization of Low Quality Images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, NV, 1995. ISRI; University of Nevada.
....models have been proposed in literature and their distinctive aspects will be here briefly summarized. KK Neighbor is an example based classifier, Yang, 1994) making use of document to document similarity estimation that selects a class for a document through a kk nearest heuristics. Rocchio (Ittner et al. 1995; Cohen and Singer, 1996) often refers to TC systems based on the Rocchio s formula for profile estimation. RIPPER (Cohen and Singer, 1996) uses an extended notion of profile, by learning contexts that are positively correlated with the target classes. A machine learning algorithms allows the ....
....3.2. Weighting in LSTC In LSTC two weighting schemes have been implemented and experimental results will be shown for both. The first is novel introducing new corpus derived parameters: it will be referred as IWF . The second is the common weighting scheme associated with the Rocchio s classifier (Ittner et al. 1995). 3.2.1. IWF based weighting In order to define the IWF weighting policy, a number of definitions is necessary. Given a training set, a feature t 2 ft 1 , t n g to describe it, a generic document d h of the corpus and the target set of classes C 1 , C 2 , let the following notations ....
[Article contains additional citation context not shown here]
Ittner D. J., Lewis D. D., and Ahn D. D. (1995). Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301-- 315, Las Vegas, US.
....combining information from several sources (general classification indexes, specialized thesauri, etc. Techniques for automatically deriving representations of categories ( category profile extraction ) and performing classification have been developed within the area of text categorization [Ittner 95, Lewis 96, Ng 97, Schtze 95, Yang 94, Yang 97] a discipline at the crossroads between information retrieval and machine learning. Text categorization uses machine learning techniques to inductively build representations of a given set of categories from a training set of documents ....
....some crosstalk. Therefore we are planning to implement category profiles also for multiple words titles. In building category profiles we have several options: create them by hand, possibly by means of some interactive tools like in ACAB [Attardi 99] or use learning techniques like those by [Ittner 95, Lewis 96] The latter techniques requires a training set of categorized documents, so it raises problems of bootstrapping. A possible solution is to start with a catalogue built with Theseus with minimal category profiles, made just with synonyms of 13 titles. A learning phase could then be ....
Ittner, D.D., Lewis, D.D., Ahn, D.: "Text categorization of low quality images", Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, US, 301--315, 1995.
....by performing a TSR by a factor of 10 using document frequency, only such words are removed, while the words from low to medium to high document frequency are preserved. Finally, note that a slightly more empirical form of feature selection by document frequency is adopted by many authors (e.g. [18, 28, 46]) who remove from consideration all terms that occur 8 Function Denoted by Mathematical form Used in Document frequency #(t k , c i ) P (t k , c i ) 1, 53] Information gain IG(t k , c i ) P (t k , c i ) log P (t k , c i ) P (c i ) P (t k ) P (t k , c i ) log P (t k , c i ) P (c i ) ....
....name by Lewis [21, page 44] In the # 2 and CC formulae, g is as usual the cardinality of the training set. In the RS(t k , c i ) formula d is a constant damping factor. in at most x training documents (popular values for x range from 1 to 3) either as the only form of dimensionality reduction [18] or before applying another more sophisticated form [28, 46] 4.1.2 Other information theoretic TSR functions Other more sophisticated information theoretic functions have been used in the literature; the most important of them are summarised in Table 1. Probabilities are interpreted as usual on ....
[Article contains additional citation context not shown here]
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, US, 1995.
.... to the j th feature of the pro le is assigned by the following formula: w j = w 1j 1 jCj X d2C x dj 1 jCj X d2C x dj ; 4) where , and are parameters and can have di erent values: in many cases, standard values are used, such as 0, 26 and 4 respectively (as done by Lewis in [10]) In the formula, x dj is the weight of the j th feature for the d th document. The information used can by e.g. the one provided by TF IDF or could be a simpler value, such as 0 if the feature is absent, 1 if present. This technique derives from studies on relevance feedback, where the pro le ....
David J. Ittner, David D. Lewis, David D. Ahn, Text Categorization of Low Quality Images, Symp. on Document Analysis and Information Retrieval, April, 1995, pp. 301-315.
....N and n i were estimated on the training data. Term weights for profiles were calculated by the Rocchio relevance feedback method [Rocchio, 1971] Rocchio was developed in the vector space model and classifiers based on it have proven to be quite effective in filtering and classification tasks [Ittner et al. 1995, Schapire et al. 1998, Ragas and Koster, 1998] Given a set of documents to be ranked for a topic, an ideal classifier should rank all relevant documents above the non relevant ones. Such an ideal classifier might just not exist. Therefore, Rocchio settles for a classifier that maximizes the ....
Ittner, D. J., Lewis, D. D., and Ahn, D. D. (1995). Text Categorization of Low Quality Images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, NV. ISRI; University of Nevada.
....A partition is formed by a test on some attribute (e.g. is the feature database equal to 0) ID3 selects the test that provides the highest gain in information content. 3.4. Rocchio s algorithm We have used a version of Rocchio s algorithm (Rocchio, 1971) adapted to text classification by Ittner et al. 1995). Rather than representing a document by a set of Boolean features indicating the presence or absence of a word, Rocchio s method uses the TF IDF weight for each informative word. TF IDF is one of the most successful and well tested weighting schemes in Information Retrieval (IR) The computation ....
....Therefore, the TF IDF weight of a term in one document is the product of its term frequency (TF) and the inverse of its document frequency (IDF) In addition, to prevent longer documents from having a better chance of retrieval, the weighted term vectors are normalized to unit length. Following Ittner et al. 1995), we use the average of the TF IDF vectors of all examples of the interesting pages, and subtract away a weighted fraction (0.25) of the TF IDF vectors of the uninteresting pages in order to get a prototype vector for the interesting class. Subtracting TF IDF vectors of the uninteresting pages ....
Ittner, D., Lewis, D., & Ahn, D. (1995). Text categorization of low quality images. Symposium on Document Analysis and Information Retrieval (pp. 301--315). UNLV, Las Vegas, NV, ISRI.
....(or in the case of call routing, destinations) most closely matches a caller s request. Call routing is distinguished from text categorization by requiring a single destination to be selected, but allowing a request to be refined in an interactive dialogue. The closest previous work to ours is (Ittner, Lewis, and Ahn, 1995), in which noisy documents produced by optical character recognition are classified against multiple categories. We are further interested in carrying out the routing process using natural, conversational language. The only work on natural language call routing to date that we are aware of is ....
Ittner, David J., David D. Lewis, and David D. Ahn. (1995). Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas.
....the presence and absence of phrases (sparse n gram) This also enables a fairer comparison with Ripper, whose rules are based only the presence or absence of words in documents. 2. 3 Rocchio We also implemented a version of Rocchio s algorithm [Rocchio, 1971] as adapted to text categorization by Ittner et al. 1995]. We represent the data (both training and test documents) as vectors of numeric weights. The weight vector for the mth document is v m = v m 1 ; v m 2 ; v m l ) where l is the number of indexing terms used. We use single words as terms. We follow the TF IDF weighting [Salton, ....
....IDF weights. We plan to explore these extensions in future research. 3.5. 2 Sensitivity of Rocchio to parameter settings In the experiments reported above, the parameters fi and fl for Rocchio were chosen based on experiments performed by different researchers on a different classification task [Ittner et al. 1995]. We also performed some smaller scale experiments to explore the sensitivity of Rocchio to these parameters, and the degree to which performance might be improved by parameter tuning. We choose eight different categories 19 from the ModApte split of the Reuters 21578 dataset, and ran Rocchio ....
[Article contains additional citation context not shown here]
David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
.... p, the cumulative loss of the master algorithm over all t can be bounded relative to the loss suffered by the best possible fixed weight vector. 2. 3 Rocchio As a basis for comparison, we implemented a version of Rocchio s algorithm [Rocchio, 1971] as adapted to text categorization by Ittner et al. 1995]. We represent the data (both training and test documents) as vectors of numeric weights. The weight vector for the mth document is v m = v m 1 ; v m 2 ; v m l ) where l is the number of indexing terms used. We use single words as terms. We follow the TF IDF weighting [Salton, ....
David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
....of simple terms. It does not provide a ranking of the possible labels for a given document. Therefore, the only performance measure we can use for comparison is the error rate. Rocchio. We implemented a version of Rocchio s algorithm [25] as adapted to text categorization by Ittner et al. [13] and modified to multiclass problems. In Rocchio, we represent the data (both training and test documents) as vectors of numeric weights. The weight vector for the ith document is v i = v i 1 ; v i 2 ; v i l ) where l is the number of indexing terms used. We use single words as ....
David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
....example, p, the cumulative loss of the master algorithm over all t can be bounded relative to the loss suffered by the best possible fixed weight vector. 2. 3 Rocchio As a basis for comparison, we implemented a version of Rocchio s algorithm [Rocchio, 1971] as adapted to text categorization by Ittner et al. 1995]. We represent the data (both training and test documents) as vectors of numeric weights. The weight vector for the mth document is v m = v m 1 ; v m 2 ; v m l ) where l is the number of indexing terms used. We use single words Parameters: fi 2 (0; 1) C 2 (0; 1) number of ....
David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
.... Machine learning researchers tend to be aware of the large pattern recognition literature on naive Bayes, but may be less aware of an equally large information retrieval (IR) literature dating back almost forty years [37,38] In fact, naive Bayes methods, along with prototype formation methods [44, 45, 24], accounted for most applications of supervised learning to information retrieval until quite recently. In this paper we briefly review the naive Bayes classifier and its use in information retrieval. We concentrate on the particular issues that arise in applying the model to textual data, and ....
David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301-315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
.... Machine learning researchers tend to be aware of the large pattern recognition literature on naive Bayes, but may be less aware of an equally large information retrieval (IR) literature dating back almost forty years [37, 38] In fact, naive Bayes methods, along with prototype formation methods [44, 45, 24], accounted for most applications of supervised learning to information retrieval until quite recently. In this paper we briefly review the naive Bayes classifier and its use in information retrieval. We concentrate on the particular issues that arise in applying the model to textual data, and ....
David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, NV, 1995. ISRI; Univ. of Nevada, Las Vegas.
No context found.
David J. Ittner, David D. Lewis, and David D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301-315, Las Vegas, US, 1995.
No context found.
ITTNER, D., LEWIS, D. and AHN, D. 1995, Text categorization of low quality images. Symposium of document analysis and information retrieval, UNLV, Las Vegas, NV, USA, ISRI, 301315.
No context found.
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval, pages 301--315, Las Vegas, US, 1995.
No context found.
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, pages 301--315, Las Vegas, US, 1995.
No context found.
D. J. Ittner, D. D. Lewis, and D. D. Ahn. Text categorization of low quality images. In Proceedings of SDAIR-95, pages 301--315, Las Vegas, US, 1995.
No context found.
Ittner, D., Lewis, D., Ahn, D., 1995, Text categorization of low quality images, Fourth Annual Symposium on Document Analysis and Information Retrieval, 301-315, Las Vegas,
No context found.
D. J. Ittner, D. D. Lewis, D.D. Ahn. Text Categorization of Low Quality Images. Proc. of the Fourth Annual Symposium on Document Analysis and Information Retrieval (SDAIR 95), Las Vegas, Nevada, April 24 - 26, 1995, pp. 301-315.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC