| Kumar Han, Karpyris. Text categorization using weight adjusted knearest neighbour classification. Technical report, Dept. of CS, University of Minnesota. 62 |
....k denotes the number of neigbours included in the evaluation. As our documents can have more than one category assigned to them, we will try a number of ways of selecting categories from kNN. Although not perfect, research has shown that kNN is the best overall performing system on diverse sets [9, 32, 31, 11]. In practice kNN compares and ranks each of the already categorized documents with the document to be categorized. This leads, as Mitchell and others [17, 15] points out, to an undesirable amount of computational costs. 1.2.2 Rocchio To reduce the demand for processing cost we will also ....
....the correctness of our tests has enough contextual information, meaning that we can easily extract useful information about the behaviour of our system 25 4. 1 The Reuters Collection For the evaluation and test of di#erent categorizing approaches we will therefore use a widely acknowledged [30, 14, 26, 11, 23] and distributed corpus originally used for information retrieval purposes. Namely, the Reuters21578 collection [16] In recent years the collection has been modified to fit text categorization purposes. The documents in the Reuters 21578 collection appeared as Reuters newswire stories in 1987. ....
[Article contains additional citation context not shown here]
Kumar Han, Karpyris. Text categorization using weight adjusted knearest neighbour classification. Technical report, Dept. of CS, University of Minnesota. 62
....classifier, but kept multiply labeled documents in the train and test sets. A few papers report classification error rate (or accuracy which is one minus error rate) on Reuters. Han et al. report a classification accuracy of 90 also by using 2000 selected features, and a weighted k NN approach [13]. Karypis et al. describe a method that determines the columns of the projection matrix as the differences of the means between clusters or classes in the data [18] This is a similar but slightly more heuristic criterion than using only the between class covariance matrix in LDA and paying no ....
Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar. Text categorization using weight adjusted k-nearest neighbor classification. In Proc. PAKDD, 2001.
....improved retrieval effectiveness with insensitivity to weight variations. In addition to text retrieval, a combination of weighting schemes and similarity measures can be directly applied to text classification as a distance measure. Applying BTWS, for example to k nearest neighbor classification [4] and centroid oriented classification [16] will be the next step to demonstrate its usefulness in classification. ....
E.S. Han, G. Karypis, and V. Kumar. Text categorization using weight adjusted k-nearest neighbor classification. Computer Science Technical Report TR99-019, Department of Computer Science, University of Minnesota, Minneapolis, Minnesota, 1999.
....used in k NN is that it uses all features in computing distances. In many document datasets, only smaller number of the total vocabulary may be useful in categorizing documents. A possible approach to overcome this problem is to learn weights for di erent features (or words in document data sets) [14]. CHAPTER 1. INTRODUCTION 9 1.2.2 Feature Projection Text Classi er FPTC is another nearest neighbor algorithm that is developed to make kNN more time ecient [12] It is an extension of the kNN algorithm and based on the idea of representing training instances as their projections on each ....
.... k nearest neighbors of test instance t on feature f into Bag [6] Bag=kBag(f; t; k) 7] for each class c [8] vote[c] vote[c] count[c,Bag] 9] prediction= UNDETERMINED class 0 [10] for each class c [11] ifvote[c] vote[prediction]then [12] prediction=c [13] return(prediction) [14] end. Figure 3.1: Classi cation in the FPTC Algorithm instance t on feature f , computes the votes of a feature. As mentioned in Equation 3.1, distance between the values on a feature dimension is computed by using diff(f; x; y) metric. Note that the bag returned by kBag(f,t,k) does not contain ....
Han, E., Karypis, G., Kumar, V., Text Categorization Using Weight Adjusted k-Nearest Neighbor Classication University of Minnesota, Minneapolis, USA. In Proceedings of The Twelfth International Joint Conference on Articial Intelligence, 1991.
....grants CUHK4166 97E and CUHK4437 99E. Some expremental results wil be given in section 4 and wewilldraw a conclusion in section 5. 2 DocumentPartitioning by Hyperlinks There are many algorithms proposed for web document categorization, likeBookmark Organizer (BO) 13] and HyPursuit[15] WAKNN [8], PEBLS [4] and VSM [12, 14] are categorization algorithms that based on k NN classification paradigm [5] The authors in [11,3]useBayesian classifiers [9] for document classification. The concept of mobile agent is also applied in [16] for classifying web documents. Because of the page limit, ....
Eui-Hong (Sam) Han, George Karypis, , and Vipin Kumar. Text categorization using weight adjusted k-nearest neighbor classification. In Technical Report # 99-019, 1999.
....terms. This concept centric nature of documents is also one of the reasons why the problem of document categorization (i.e. assigning a document into a pre determined class or topic) is particularly challenging. Over the years a variety of document categorization algorithms have been developed [12, 22, 50, 33, 42, 3, 69, 45, 25], both from the machine learning as well as from the Information Retrieval (IR) community. A surprising result of this research has been that naive Bayesian, a relatively simple classification algorithm, performs well [47, 48, 46, 54, 17] for document categorization, even when compared against ....
....due to the inconsistent performance of such schemes in these data sets. In particular, the right number of dimensions for different data sets varies considerably. For detailed experiments showing the characteristics of feature selection schemes in text categorization, readers are advised to see [70, 25]. 7 Conclusion and Directions of Future Work In this paper we presented a new fast dimensionality reduction technique called concept indexing that can be used equally well for reducing the dimensions in a supervised and in an unsupervised setting. CI reduces the dimensionality of a document ....
[Article contains additional citation context not shown here]
Eui-Hong Han. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. PhD thesis, University of Minnesota, October 1999.
....that can help people find information from these huge resources. Text categorization presents huge challenges due to a large number of attributes, attribute dependency, multi modality and large training set. The various document categorization algorithms that have been developed over the years [36, 1, 8, 11, 25, 16, 19, 2, 42, 20, 13] fall under two general categories. The first category contains traditional machine learning algorithms such as decision trees, rule sets, instance based classifiers, probabilistic classifiers, support vector machines, etc. that have either been used directly or after being adapted for use in the ....
....results, primarily on low dimensional data sets. Unfortunately, one of the characteristics of document data sets is that there is a relatively large number of features that characterize each class. Decision tree based schemes like C4.5 do not work very well in this scenario due to overfitting [5, 13]. The over fitting occurs because the number of samples is relatively small with respect to the number of distinguishing words, which leads to very large trees with limited generalization ability. The C4.5 results were obtained using a locally modified version of the C4.5 algorithm capable of ....
Eui-Hong Han. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. PhD thesis, University of Minnesota, October 1999.
....huge resources. Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, attribute dependency, and multi modality of categories. This has led to the development of a variety of text categorization algorithms [31, 22, 25, 2, 55, 26, 19] that address these challenges to varying degrees. In this paper we focus on a simple centroid based document classification algorithm that has not been extensively studied and analyzed despite its simplicity and, as our experiments show, its robust performance. In this algorithm, a centroid ....
....Section 5 analyzes the classification model of the centroid based classifier and compares it against those used by other algorithms. Finally, Section 6 provides directions for future research. 2 Previous Work The various document categorization algorithms that have been developed over the years [47, 1, 10, 17, 31, 22, 25, 2, 55, 26, 19] fall under two general categories. The first category contains traditional machine learning algorithms such as decision trees, rule sets, instance based classifiers, probabilistic classifiers, support vector machines, etc. that have either been used directly or being adapted for use in the ....
[Article contains additional citation context not shown here]
Eui-Hong Han. Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. PhD thesis, University of Minnesota, October 1999.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC