• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Hash table sizes for storing n-grams for text processing (2000)

by Z Gu, D Berleant
Add To MetaCart

Tools

Sorted by:
Results 1 - 3 of 3

Hirsch: Evolving Rules for Document Classification

by Laurence Hirsch, Masoud Saeedi, Robin Hirsch - Proceedings of the 8th European Conference on Genetic Programming 3447 (2005) 85–95
"... Abstract. We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Abstract. We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that because the induced rules are meaningful to a human analyst they may have a number of other uses beyond classification and provide a basis for text mining applications. 1
(Show Context)

Citation Context

...ngle terms, phrases or particular sequences of terms. Where N-Grams or phrases are used the length of the phrase or N-Gram must also be determined. Although many of these options have been researched =-=[19]-=- it is often the case that effects on the performance of the classifier will depend on the particular classifier and the particular text environment [20]. We have developed a GP system where many of t...

ADtrees for Sequential Data and N-gram Counting

by Byu Scholarsarchive, Robert Van Dam, Dan A. Ventura, Rob Van Dam, Dan Ventura , 2007
"... Abstract—We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many ..."
Abstract - Add to MetaCart
Abstract—We consider the problem of efficiently storing n-gram counts for large n over very large corpora. In such cases, the efficient storage of sufficient statistics can have a dramatic impact on system performance. One popular model for storing such data derived from tabular data sets with many attributes is the ADtree. Here, we adapt the ADtree to benefit from the sequential structure of corpora-type data. We demonstrate the usefulness of our approach on a portion of the well-known Wall Street Journal corpus from the Penn Treebank and show that our approach is exponentially more efficient than the naı̈ve approach to storing n-grams and is also significantly more efficient than a traditional prefix tree. I.
(Show Context)

Citation Context

...rrection, part of speech tagging [1], [2], [3], [4], [5] and have since been appropriated by researchers in many other fields, particularly information retrieval, bioinformatics, and data compression =-=[6]-=-, [7], [8]. N -grams have become so ubiquitous due in great part to their flexibility. Not only can models be built which take into account vastly different amounts of context (the n in n-gram) but th...

Published version

by Laurence Hirsch, Robin Hirsch, Masoud Saeedi
"... This document is the author deposited version. You are advised to consult the publisher's version if you wish to cite from it. ..."
Abstract - Add to MetaCart
This document is the author deposited version. You are advised to consult the publisher's version if you wish to cite from it.
(Show Context)

Citation Context

...ngle terms, phrases or particular sequences of terms. Where N-Grams or phrases are used the length of the phrase or N-Gram must also be determined. Although many of these options have been researched =-=[19]-=- it is often the case that effects on the performance of the classifier will depend on the particular classifier and the particular text environment [20]. We have developed a GP system where many of t...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University