Results 1 -
3 of
3
Extracting key-substring-group features for text classification
- In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ’06
, 2006
"... In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
In many text classification applications, it is appealing to take every document as a string of characters rather than a bag of words. Previous research studies in this area mostly focused on different variants of generative Markov chain models. Although discriminative machine learning methods like Support Vector Machine (SVM) have been quite successful in text classification with word features, it is neither effective nor efficient to apply them straightforwardly taking all substrings in the corpus as features. In this paper, we propose to partition all substrings into statistical equivalence groups, and then pick those groups which are important (in the statistical sense) as features (named keysubstring-group features) for text classification. In particular, we propose a suffix tree based algorithm that can extract such features in linear time (with respect to the total number of characters in the corpus). Our experiments on English, Chinese and Greek datasets show that SVM with key-substring-group features can achieve outstanding performance for various text classification tasks.
Text Augmentation: Inserting XML tags into natural language text with PPM Models and Viterbi-like search
, 2003
"... This thesis develops work on using Hidden Markov Models to insert tags natural language text. A taxonomy of tags is developed unifying the fields of text segmentation tagging, part-of-speech tagging, proper noun extraction and hierarchical entity extraction. The search spaces for inserting tags are ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This thesis develops work on using Hidden Markov Models to insert tags natural language text. A taxonomy of tags is developed unifying the fields of text segmentation tagging, part-of-speech tagging, proper noun extraction and hierarchical entity extraction. The search spaces for inserting tags are examined from both a theoretical and experimental point of view across the taxonomy and on four corpora. A analysis of different correctness measures for different types of tag insertion problem is undertaken and a technique to determine whether tag-insertion errors are the result of a modelling failure or a searching failure is discovered.
Using Compression to Identify Classes of Inauthentic Texts
- Proceedings of the 2006 SIAM Conference on Data Mining
, 2006
"... Recent events have made it clear that some kinds of technical texts, generated by machine and essentially meaningless, can be confused with authentic, technical texts written by humans. We identify this as a potential problem, since no existing systems for, say the web, can or do discriminate on thi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Recent events have made it clear that some kinds of technical texts, generated by machine and essentially meaningless, can be confused with authentic, technical texts written by humans. We identify this as a potential problem, since no existing systems for, say the web, can or do discriminate on this basis. We believe that there are subtle, short- and longrange word or even string co-occurrences extant in human texts, but not in many classes of computer generated texts, that can be used to discriminate based on meaning. In this paper we employ the universal lossless source coding algorithms to generate features in a high-dimensional space and then apply support vector machines to discriminate between the classes of authentic and inauthentic texts. Compression profiles for the two kinds of text are distinct— the authentic texts being bounded by various classes of more compressible or less compressible texts that are computer generated. This in turn led to the high prediction accuracy of our models which support our conjecture that there exists a relationship between meaning and compressibility. Our results show that the learning algorithm based upon the compression profile outperformed standard term-frequency text categorization schemes on several non-trivial classes of inauthentic texts. 1

