Results 1 -
3 of
3
Hirsch: Evolving Rules for Document Classification
- Proceedings of the 8th European Conference on Genetic Programming 3447 (2005) 85–95
"... Abstract. We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Abstract. We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that because the induced rules are meaningful to a human analyst they may have a number of other uses beyond classification and provide a basis for text mining applications. 1
ADtrees for Sequential Data and N-gram Counting
, 2007
"... Abstract—We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract—We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naı̈ve approach to storing n-grams and is also significantly more efficient than a traditional prefix tree. I.
Published version
"... This document is the author deposited version. You are advised to consult the publisher's version if you wish to cite from it. ..."
Abstract
- Add to MetaCart
(Show Context)
This document is the author deposited version. You are advised to consult the publisher's version if you wish to cite from it.