| F. Damerau C. Apte and S. Weiss. 1994. Toward language independent automated learning of text categorization models. In Proceedings SIGIR-94. |
.... in the document) Many current systems that learn on text use the bag of words representation using either Boolean features indicating if a specific word occurred in a document (e.g. 6, 39, 40, 41, 42, 43, 44, 26, 45, 46, 10, 11, 12, 47, 48] or the frequency of a word in a given document (e.g. [49, 50, 7, 51, 52, 53, 54, 55, 56, 4, 37, 48, 57]) There is also some work that uses additional information such as word position [40, 58, 12] or word tuples called n grams [59, 37, 38, 60] e.g. machine learning is a 2 gram and World Wide Web is a 3 gram) Some recent work [47] indicates that the usage of hypertext structure and graph ....
....different document representations over several domains showing clear advantages of some representations. 2) One of the frequently used approaches to reduce the number of different words is to remove words that occur in the stop list containing common English words like a , the , with (e.g. [49, 7, 56, 43, 37, 10, 11, 12, 61]) or pruning the infrequent words (word frequency min.frequency) e.g. 40, 53, 54, 37] Connected to the particular language is also word stemming, used for example in [50, 7, 12, 61] that reduces the number of different words using a languagespecific stemming algorithm (e.g. works , ....
[Article contains additional citation context not shown here]
Apt'e, C., Damerau, F., Weiss, S.M., Toward Language Independent Automated Learning of Text Categorization Models, Proc. of the 7th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Dubline, 1994.
....which learning algorithm is used. Table 2 summarizes them over some related papers in order to give an idea about the current trends. Systems given in Table 1 are included in this more detailed analysis, if there was Paper reference Document Feature Learning Representation Selection Apt e et al. [2] bag of words stop list Decision Rules (freq) frequency weight Armstrong et al. 3] bag of words informativity TFIDF Winnow, WordStat Balabanovi c and bag of words stop list stemming TFIDF Shoham [4] freq) keep 10 best words Bartell et al. 6] bag of words latent semantic (freq) indexing ....
.... (the number of times it occurs in the document) Many current systems that learn on text use the bag of words representation using either Boolean features indicating if a specific word occurred in a document (e.g. 3, 11, 32, 34, 40, 41, 47] or frequency of a word in a given document (e.g. [2, 4, 6, 7, 21, 29, 47]) There is also some work that uses additional information such as word position [11] or word tuples called n grams [13, 45] e.g. machine learning is a 2 gram and World Wide Web is a 3 gram) 2) One of the frequently used approaches to reduce the number of different words is to use ....
[Article contains additional citation context not shown here]
Apt'e, C., Damerau, F., Weiss, S.M., Toward Language Independent Automated Learning of Text Categorization Models, Proc. of the 7th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, Dubline, 1994.
....test sets. In addition, we have tested our final version of the classifier on two common partitions of the complete Reuters collection, and compare the results with those of other works. The two partitions used are those of Lewis (Lewis, 1992) 14704 documents for training, 6746 for testing) and Apte (Apte, Damerau, and Weiss, 1994) (10645 training, 3672 testing, omitting documents with no topical category) To evaluate performance, the usual measures of recall and precision were used. Specifically, we measured the effectiveness of the classification by keeping track of the following four numbers: ffl p 1 = number of ....
.... 83.3 74.7 Experts unigram (Cohen and Singer, 1996) 64.7 65.6 Neural Network (Wiener, Pedersen, and Weigend, 1995) 77.5 NA Rocchio (Rocchio, 1971) 74.5 66.0 Ripper (Cohen and Singer, 1996) 79.6 71.9 Decision trees (Lewis and Ringuette, 1994) NA 67.0 Bayes (Lewis and Ringuette, 1994) NA 65.0 SWAP (Apte, Damerau, and Weiss, 1994) 78.9 NA Table 2: Break even points comparison. The data is split into training set and test set based on Lewis s split (Lewis, 1992) 14704 documents for training, 6746 for testing, and Apte s split (Apte, Damerau, and Weiss, 1994) 10645 training, 3672 testing, omitting documents with no ....
[Article contains additional citation context not shown here]
Apte, C., F. Damerau, and S. Weiss. 1994. Towards language independent automated learning of text categorization models. In Proceedings of ACM-SIGIR Conference on Information Retrieval.
....selected feature subset. The usual way of learning on text defines a feature for each word that occurred in training documents. This can easily result with several tens of thousands of features. Most methods for feature subset selection that are used information retrieval and text learning (e.g. [1], 3] 11] are very simple compared to the methods developed in machine learning. Basically, some scoring measure that is used on a single feature is selected, a score is assigned to each feature independently, features are sorted according to the assigned score and a predefined number of the ....
Apt'e, C., Damerau, F., Weiss, S.M., Toward Language Independent Automated Learning of Text Categorization Models, Proc. of the 7th Annual Int. ACM-SIGIR Conference on Research and Development in Information Retrieval, 1994.
....for maximizing their predictive performance. 1 Background Our initial methodology for automatic text categorization was built around the use of rule induction, coupled with a new approach to constructing feature vectors, that emphasized the use of local dictionaries and numerical features [ Apt e et al. 1994a, Apt e et al. 1994b ] Morerecently,wehave begun exploring methods for maximizing the predictive accuracy of the models constructed from the mining process. This is an important requirement, particularly in real world applications, where noisy and limited samples are a pervasive problem. One ....
....predictive performance. 1 Background Our initial methodology for automatic text categorization was built around the use of rule induction, coupled with a new approach to constructing feature vectors, that emphasized the use of local dictionaries and numerical features [ Apt e et al. 1994a, Apt e et al. 1994b ] Morerecently,wehave begun exploring methods for maximizing the predictive accuracy of the models constructed from the mining process. This is an important requirement, particularly in real world applications, where noisy and limited samples are a pervasive problem. One particular approach that ....
[Article contains additional citation context not shown here]
C. Apt'e, F. Damerau, and S. Weiss. Towards Language Independent Automated Learning of Text Categorization Methods. pages 23--30, 1994.
No context found.
F. Damerau C. Apte and S. Weiss. 1994. Toward language independent automated learning of text categorization models. In Proceedings SIGIR-94.
No context found.
C. Apt, F. Damerau, S. M. Weiss. Towards Language Independent Automated Learning of Text Categorization Models. Proc. of 17th International Conference on Research and Development in Information Retrieval (SIGIR'94). Dublin City, Ireland, July 3 - 6, 1994, pp. 23-30.
No context found.
C. Apt, F. Damerau, and S. M. Weiss. Towards Language Independent Automated Learning of Text Categorization Models. SIGIR'94. Annual ACM SIGIR Conference on Research and Development in Information Retrieval. P24-30, 1994.
No context found.
, pages 23--30, Dublin, Ireland, July 3-6 1994.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC