| Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996. |
....It is, however, a dicult problem to nd a suitable subset of words that still represents the essential characteristics of the documents. It is also important to remove the words which are not informative, hence most common words like and, with, to etc. which are also known as stop words [ Yang and Wilbur, 1996 ] are removed from the text while creating the vector. In addition to term frequency (TF) its IDF (Inverse Document Frequency) Tokunaga and Iwayama, 1994 ] is also used to score a term. IDF of term t is de ned as IDF (t) log N N t (1) where N is the total number of documents in ....
Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.
....words, and using weighted document vectors. 6 Implementation Details We implemented the following algorithm for creating document vectors, training category models, and nally, nding categories of new documents. 1. Perform preprocessing on documents. This includes removal of stop words [ Yang and Wilbur, 1996 ] and stemming [ Porter, 1997 ] 2. Prepare TFIDF vectors for training documents. IDF (Inverse Document Frequency) for a word w is de ned as [ Tokunaga and Iwayama, 1994 ] IDF (w) log N Nw (11) where N is the total number of documents and Nw is the number of documents in which word ....
Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996. 10
....a news story we can get the corresponding speech le. In speech queries we generally say I would like to, Is there any information about etc. Such parts of speech are not essential for our retrieval process and they have to be ignored to obtain better results. These are called stop words [ Yang and Wilbur, 1996 ] Hence stop word removal from the query should be performed before any query processing is done. The document and query can be represented in terms of vectors where each component of the vector is an indexing term. The weight of each term i in the vector for document d is popularly found as ....
Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.
....(e.g. Yang and Pedersen, 1997] Moutinier et at. 1996] The most popular approach to feature selection is to select a subset of the available fea tures using methods like DF thresholding [Yang and Pedersen, 1997] the x2 test [Schfitze et at. 1995] or the term strength criterion [Yang and Wilbur, 1996]. The most commonly used and often most effective [Yang and Pedersen, 1997] method for selecting features is the information gain criterion. It will be used in this paper following the setup in [Yang and Pedersen, 1997] All words are ranked according to their information gain. To select a subset ....
Yang, Y. and Wilbur, J. (1996). Using corpus statistics to re- move redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357-369.
....procedures for news categorization becomes a practical as well as a challenging problem. Typically, such a system is built by using a training set to extract the features that characterize the individual news categories. The techniques employed include machine learning [6] 14] 16] statistical [12][19], knowledge based [7] or the combinations [5] After the initial training stage, periodic maintenance of the news categorization system is needed to avoid performance deterioration due to presence of new terms and new topics being discussed in the news articles. Also, outdated and expired terms ....
....Square Fit Mapping, to reduce noise for computational ef Thetaciency. In this study, multiple noise reducing strategies were used and the results show signi Thetacant improvements in ef Thetaciency without losing categorization accuracy. The author uses corpus statistics in text categorization [19]. In this approach, each word in the training text is associated with a #word strength# indicating its importance, and words are removed if the associated word strength values are less than some threshold. Work has also been done on batch updates in B trees [10] for ef Thetaciency purposes. It is ....
Y. Yang. #Using Corpus Statistics to Remove Redundant Words in Text Categorization# JASIS, pp. 13-22, 1996.
....procedures for news categorization becomes a practical as well as a challenging problem. A typical solution uses a training set to extract the features that characterize the individual news categories. The techniques employed include machine learning [2] 10] 13] 21] 23] 25] statistical [18][27], knowledge based [15] or the combinations [9] After the initial training stage, the news categorization system may apply periodic maintenance to avoid performance deterioration caused by the presence of new terms and new topics being discussed in the news articles. Also, outdated and expired ....
....Linear Least Square Fit Mapping, to reduce noise for computational efficiency. In this study, multiple noise reducing strategies were used and the results show significant improvements in efficiency without losing categorization accuracy. The author uses corpus statistics in text categorization [27]. In this approach, each word in the training text is associated with a word strength indicating its importance, and words are removed if the associated word strength values are less than some threshold. Work has also been done on batch updates in B trees [16] for efficiency purposes. It is ....
Y. Yang. "Using Corpus Statistics to Remove Redundant Words in Text Categorization", JASIS, pp. 13-22, 1996.
.... (e.g. Yang and Pedersen, 1997] Moulinier et al. 1996] The most popular approach to feature selection is to select a subset of the available features using methods like DF thresholding [Yang and Pedersen, 1997] the 2 test [Schutze et al. 1995] or the term strength criterion [Yang and Wilbur, 1996]. The most commonly used and often most effective [Yang and Pedersen, 1997] method for selecting features is the information gain criterion. It will be used in this paper following the setup in [Yang and Pedersen, 1997] All words are ranked according to their information gain. To select a subset ....
Yang, Y. and Wilbur, J. (1996). Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357--369.
....been proposed with different word weighting schemes [53] A more recent variant known as latent semantics indexing [12] performs a dimension reduction by singular value decomposition. Related methods of feature selection have been proposed for text categorization, e.g. the term strength criterion [66]. In contrast, we propose a model based statistical approach and present a family of finite mixture models [59, 35] as a way to deal with the data sparseness problem. Since mixture or class based models can also be combined with other models our goal is orthogonal to standard interpolation ....
Y. Yang and J. Willbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357--369, 1996.
....retrieval. In general the proportion of removable words may be less but still vary significant. Yang and Wilbur have applied the aggressive word removal method to document categorization to remove non informative words from documents before applying a categorization method to these documents [18]. The effects on several categorization methods on different document collections have been studied and the effectiveness has been evident in the experiments. For all the methods tested, including two statistical learning methods based on manual category assignments and a baseline text matching ....
Yang Y, W.J. Wilbur. (1995) Using Corpus Statistics to Remove Redundant Words in Text Categorization, J Amer Soc Inf Sci (accepted).
No context found.
Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.
No context found.
Y. Yang and J. W. Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357--369, 1996.
No context found.
Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.
No context found.
Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC