13 citations found. Retrieving documents...
Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Constructing Better Document Vectors Universal.. - Shah, Chowdhary.. (2002)   (Correct)

....It is, however, a dicult problem to nd a suitable subset of words that still represents the essential characteristics of the documents. It is also important to remove the words which are not informative, hence most common words like and, with, to etc. which are also known as stop words [ Yang and Wilbur, 1996 ] are removed from the text while creating the vector. In addition to term frequency (TF) its IDF (Inverse Document Frequency) Tokunaga and Iwayama, 1994 ] is also used to score a term. IDF of term t is de ned as IDF (t) log N N t (1) where N is the total number of documents in ....

Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.


A Study for Evaluating the Importance of Various Parts of.. - Shah, Bhattacharyya (2002)   (Correct)

....words, and using weighted document vectors. 6 Implementation Details We implemented the following algorithm for creating document vectors, training category models, and nally, nding categories of new documents. 1. Perform preprocessing on documents. This includes removal of stop words [ Yang and Wilbur, 1996 ] and stemming [ Porter, 1997 ] 2. Prepare TFIDF vectors for training documents. IDF (Inverse Document Frequency) for a word w is de ned as [ Tokunaga and Iwayama, 1994 ] IDF (w) log N Nw (11) where N is the total number of documents and Nw is the number of documents in which word ....

Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996. 10


Spoken Document Retrieval (SDR) for Broadcast News in Indian.. - Shah, Khan (2001)   (Correct)

....a news story we can get the corresponding speech le. In speech queries we generally say I would like to, Is there any information about etc. Such parts of speech are not essential for our retrieval process and they have to be ignored to obtain better results. These are called stop words [ Yang and Wilbur, 1996 ] Hence stop word removal from the query should be performed before any query processing is done. The document and query can be represented in terms of vectors where each component of the vector is an indexing term. The weight of each term i in the vector for document d is popularly found as ....

Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.


Text Categorization with Support Vector Machines: Learning with.. - Joachims (1997)   (357 citations)  (Correct)

....(e.g. Yang and Pedersen, 1997] Moutinier et at. 1996] The most popular approach to feature selection is to select a subset of the available fea tures using methods like DF thresholding [Yang and Pedersen, 1997] the x2 test [Schfitze et at. 1995] or the term strength criterion [Yang and Wilbur, 1996]. The most commonly used and often most effective [Yang and Pedersen, 1997] method for selecting features is the information gain criterion. It will be used in this paper following the setup in [Yang and Pedersen, 1997] All words are ranked according to their information gain. To select a subset ....

Yang, Y. and Wilbur, J. (1996). Using corpus statistics to re- move redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357-369.


Feature Reduction and Database Maintenance in NETNEWS.. - Hsu, Lang   (Correct)

....procedures for news categorization becomes a practical as well as a challenging problem. Typically, such a system is built by using a training set to extract the features that characterize the individual news categories. The techniques employed include machine learning [6] 14] 16] statistical [12][19], knowledge based [7] or the combinations [5] After the initial training stage, periodic maintenance of the news categorization system is needed to avoid performance deterioration due to presence of new terms and new topics being discussed in the news articles. Also, outdated and expired terms ....

....Square Fit Mapping, to reduce noise for computational ef Thetaciency. In this study, multiple noise reducing strategies were used and the results show signi Thetacant improvements in ef Thetaciency without losing categorization accuracy. The author uses corpus statistics in text categorization [19]. In this approach, each word in the training text is associated with a #word strength# indicating its importance, and words are removed if the associated word strength values are less than some threshold. Work has also been done on batch updates in B trees [10] for ef Thetaciency purposes. It is ....

Y. Yang. #Using Corpus Statistics to Remove Redundant Words in Text Categorization# JASIS, pp. 13-22, 1996.


Classification Algorithms for NETNEWS Articles - Hsu, Lang (1999)   (2 citations)  (Correct)

....procedures for news categorization becomes a practical as well as a challenging problem. A typical solution uses a training set to extract the features that characterize the individual news categories. The techniques employed include machine learning [2] 10] 13] 21] 23] 25] statistical [18][27], knowledge based [15] or the combinations [9] After the initial training stage, the news categorization system may apply periodic maintenance to avoid performance deterioration caused by the presence of new terms and new topics being discussed in the news articles. Also, outdated and expired ....

....Linear Least Square Fit Mapping, to reduce noise for computational efficiency. In this study, multiple noise reducing strategies were used and the results show significant improvements in efficiency without losing categorization accuracy. The author uses corpus statistics in text categorization [27]. In this approach, each word in the training text is associated with a word strength indicating its importance, and words are removed if the associated word strength values are less than some threshold. Work has also been done on batch updates in B trees [16] for efficiency purposes. It is ....

Y. Yang. "Using Corpus Statistics to Remove Redundant Words in Text Categorization", JASIS, pp. 13-22, 1996.


Text Categorization with Support Vector Machines: Learning with.. - Joachims (1998)   (357 citations)  (Correct)

.... (e.g. Yang and Pedersen, 1997] Moulinier et al. 1996] The most popular approach to feature selection is to select a subset of the available features using methods like DF thresholding [Yang and Pedersen, 1997] the 2 test [Schutze et al. 1995] or the term strength criterion [Yang and Wilbur, 1996]. The most commonly used and often most effective [Yang and Pedersen, 1997] method for selecting features is the information gain criterion. It will be used in this paper following the setup in [Yang and Pedersen, 1997] All words are ranked according to their information gain. To select a subset ....

Yang, Y. and Wilbur, J. (1996). Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357--369.


Statistical Models for Co-occurrence Data - Hofmann, Puzicha (1998)   (24 citations)  (Correct)

....been proposed with different word weighting schemes [53] A more recent variant known as latent semantics indexing [12] performs a dimension reduction by singular value decomposition. Related methods of feature selection have been proposed for text categorization, e.g. the term strength criterion [66]. In contrast, we propose a model based statistical approach and present a family of finite mixture models [59, 35] as a way to deal with the data sparseness problem. Since mixture or class based models can also be combined with other models our goal is orthogonal to standard interpolation ....

Y. Yang and J. Willbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357--369, 1996.


Noise Reduction in a Statistical Approach to Text Categorization - Yang (1995)   (16 citations)  Self-citation (Yang)   (Correct)

....retrieval. In general the proportion of removable words may be less but still vary significant. Yang and Wilbur have applied the aggressive word removal method to document categorization to remove non informative words from documents before applying a categorization method to these documents [18]. The effects on several categorization methods on different document collections have been studied and the effectiveness has been evident in the experiments. For all the methods tested, including two statistical learning methods based on manual category assignments and a baseline text matching ....

Yang Y, W.J. Wilbur. (1995) Using Corpus Statistics to Remove Redundant Words in Text Categorization, J Amer Soc Inf Sci (accepted).


Evaluating High Accuracy Retrieval Techniques - Shah, Croft (2004)   (Correct)

No context found.

Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.


Combining Machine Learning and Hierarchical Structures for Text.. - Ruiz (2001)   (1 citation)  (Correct)

No context found.

Y. Yang and J. W. Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society for Information Science, 47(5):357--369, 1996.


Evaluating High Accuracy Retrieval Techniques - Chirag Shah Bruce   (Correct)

No context found.

Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.


Improving Document Vectors Representation Using Semantic.. - Shah, Bhattacharyya   (Correct)

No context found.

Yiming Yang and John Wilbur. Using corpus statistics to remove redundant words in text categorization. Journal of the American Society of Information Science, 47(5), 1996.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC