DMCA
Introducing a Family of Linear Measures for Feature Selection in Text Categorization
Venue: | IEEE Transactions on Knowledge and Data Engineering |
Citations: | 16 - 2 self |
Citations
13212 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...t linear or nonlinear threshold functions to separate the examples of a certain category from the rest. They are based on the Structural Minimization Risk principle from computational learning theory =-=[17]-=-. The idea of structural risk minimization is to find a hypothesis h for which it is guaranteed the lowest true error. The true error of h is the probability that h will make an error on an unseen 122... |
4015 |
Introduction to Modern Information Retrieval
- Salton, McGill
- 1983
(Show Context)
Citation Context ...t one consists of removing the stop words because they are useless for the classification. The second one involves mapping words with the same meaning to one morphological, which is known as stemming =-=[2]-=-. The Porter algorithm [12] is used in this paper for this purpose. This algorithm strips common terminating strings (suffixes) from words in order to reduce them to their roots or stems. A list of su... |
3701 | Support-vector networks
- Cortes, Vapnik
- 1995
(Show Context)
Citation Context ... This transformation makes possible the use of binary classifiers for the multicategory classification problem [11]. In this paper, the classification is performed using Support Vector Machines (SVM) =-=[14]-=-, since they have been shown to perform fast [15] and well [16] in TC. The key of this good performance is that SVM are able to handle many features and to deal well with sparse examples. SVM are univ... |
2301 | Text Categorization with Support Vector Machines: Learning with Many Relevant Features
- Joachims
- 1997
(Show Context)
Citation Context ...rd (tf) in order to weight its importance in the document. Another measure for this purpose is tfidf , which takes into account the distribution of the words in the documents, or its variant tfc (see =-=[11]-=-), which also considers the different lengths of the documents. In this paper, tf is chosen because it is one of the most used [1], [6]. In the document representation, different sets of words can be ... |
852 | A re-examination of text categorization methods
- Yang, Liu
- 1999
(Show Context)
Citation Context ...rs for the multicategory classification problem [11]. In this paper, the classification is performed using Support Vector Machines (SVM) [14], since they have been shown to perform fast [15] and well =-=[16]-=- in TC. The key of this good performance is that SVM are able to handle many features and to deal well with sparse examples. SVM are universal binary classifiers able to find out linear or nonlinear t... |
650 | Inductive learning algorithms and representations for text categorization
- Dumais, Platt, et al.
- 1998
(Show Context)
Citation Context ...nary classifiers for the multicategory classification problem [11]. In this paper, the classification is performed using Support Vector Machines (SVM) [14], since they have been shown to perform fast =-=[15]-=- and well [16] in TC. The key of this good performance is that SVM are able to handle many features and to deal well with sparse examples. SVM are universal binary classifiers able to find out linear ... |
311 | Automated Learning of Decision Rules for Text Categorization”,
- Apte, Damerau, et al.
- 1994
(Show Context)
Citation Context ...document from a finite set of m categories is commonly converted into m binary problems, each one consisting of determining whether a document belongs to a fixed category or not (one-againstthe-rest) =-=[13]-=-. This transformation makes possible the use of binary classifiers for the multicategory classification problem [11]. In this paper, the classification is performed using Support Vector Machines (SVM)... |
154 | Incremental reduced error pruning - Fürnkranz, Widmer - 1994 |
144 | Feature selection for unbalanced class distribution and Naive Bayes
- Mladenic, Grobelnik
- 1999
(Show Context)
Citation Context ...know that the word w does not occur in it. Usually, these probabilities are estimated by means of the corresponding relative frequencies. In the same direction, expected cross entropy for text (CET ) =-=[6]-=- only takes into account the presence of the word in a category. It is defined by CET ðw; cÞ P ðwÞ P ðcjwÞ logP ðcjwÞ P ðcÞ : Yang and Pedersen [4] introduced the statistic 2 for feature reduct... |
58 | Experiments on the use of feature selection and negative evidence in automated text categorization
- Galavotti, F, et al.
- 2000
(Show Context)
Citation Context ...en [4] introduced the statistic 2 for feature reduction, which measures the lack of independence between a word and a category. A modification of this measure is S 2, proposed by Galavotti et al. =-=[7]-=-, who defined it by S 2ðw; cÞ P ðw; cÞ P ðw; cÞ P ðw; cÞ P ðw; cÞ: It has been shown that it performs better than 2 (see [7]). 2.3 Machine Learning Measures In [8], [9], we proposed severa... |
44 |
A Comparative Study on Feature Selection
- Yang, Pedersen
- 1997
(Show Context)
Citation Context ... to use them. For this reason, the use of filtering measures is prominent in TC. It is based on scoring the features with some relevance measures and selecting a predefined number from the top ranked =-=[4]-=-. In this paper, we introduce a new family of filtering measures for FS in TC. They are simpler than most existing measures, but the experiments carried out show that they perform equal or better than... |
36 | Machine learning in automated text categorisation
- Sebastiani
- 1999
(Show Context)
Citation Context ...e collections of text files. One of the main tasks in this processing is that of assigning the documents of a corpus to a set of previously fixed categories, what is known as Text Categorization (TC) =-=[1]-=-. This process involves some understanding of the contents of the documents and/or some previous knowledge of the topics. For this reason, this task has been traditionally performed by human readers. ... |
19 |
An algorithm for suffix stripping, Program: Automated Library and Information Systems
- Porter
(Show Context)
Citation Context ... the stop words because they are useless for the classification. The second one involves mapping words with the same meaning to one morphological, which is known as stemming [2]. The Porter algorithm =-=[12]-=- is used in this paper for this purpose. This algorithm strips common terminating strings (suffixes) from words in order to reduce them to their roots or stems. A list of suffixes to be removed is spe... |
14 |
Scoring and Selecting Terms for Text Categorization
- Montanes, Díaz, et al.
(Show Context)
Citation Context ...by Galavotti et al. [7], who defined it by S 2ðw; cÞ P ðw; cÞ P ðw; cÞ P ðw; cÞ P ðw; cÞ: It has been shown that it performs better than 2 (see [7]). 2.3 Machine Learning Measures In [8], =-=[9]-=-, we proposed several measures taken from the Machine Learning (ML) environment. Specifically, they are measures previously applied to quantify the quality of the rules induced by an ML algorithm. In ... |
2 |
Measures of Rule Quality for Feature Selection
- Montanes, Ferandez, et al.
- 2003
(Show Context)
Citation Context ...osed by Galavotti et al. [7], who defined it by S 2ðw; cÞ P ðw; cÞ P ðw; cÞ P ðw; cÞ P ðw; cÞ: It has been shown that it performs better than 2 (see [7]). 2.3 Machine Learning Measures In =-=[8]-=-, [9], we proposed several measures taken from the Machine Learning (ML) environment. Specifically, they are measures previously applied to quantify the quality of the rules induced by an ML algorithm... |
1 |
Improving Performance of Text Categorisation by Combining Filtering and Support Vector
- Dı́az, Ranilla, et al.
- 2004
(Show Context)
Citation Context ...ord occurring in many documents will have tfidf smaller than others, with the same tf, but appearing in less documents. Despite their simple appearance, these measures perform well in many situations =-=[5]-=-. 2.2 Information Theory Measures Measures taken from Information Theory (IT) have been widely used because it is interesting to consider the distribution of a word over the different categories. Amon... |