• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

DMCA

Text Classification using String Kernels

Cached

  • Download as a PDF

Download Links

  • [www.site.uottawa.ca]
  • [www.doc.ic.ac.uk]
  • [www.cs.rhul.ac.uk]
  • [www.jmlr.org]
  • [eric.univ-lyon2.fr]
  • [eric.univ-lyon2.fr]
  • [www.support-vector.net]
  • [www.neurocolt.com]
  • [www.neurocolt.com]
  • [eprints.soton.ac.uk]
  • [www.cs.cmu.edu]
  • [jmlr.csail.mit.edu]
  • [www.ai.mit.edu]
  • [oucsace.cs.ohiou.edu]
  • [oucsace.cs.ohiou.edu]
  • [www.jmlr.org]
  • [www.uniroma2.it]
  • [oucsace.cs.ohiou.edu]
  • [jmlr.org]
  • [ace.cs.ohio.edu]
  • [www.cs.fit.edu]
  • [cs.fit.edu]
  • [cs.fit.edu]

  • Other Repositories/Bibliography

  • DBLP
  • Save to List
  • Add to Collection
  • Correct Errors
  • Monitor Changes
by Huma Lodhi , Craig Saunders , John Shawe-Taylor , Nello Cristianini , Chris Watkins
Citations:495 - 7 self
  • Summary
  • Citations
  • Active Bibliography
  • Co-citation
  • Clustered Documents
  • Version History

BibTeX

@MISC{Lodhi_textclassification,
    author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Chris Watkins},
    title = {Text Classification using String Kernels},
    year = {}
}

Share

Facebook Twitter Reddit Bibsonomy

OpenURL

 

Abstract

We propose a novel approach for categorizing text documents based on the use of a special kernel. The kernel is an inner product in the feature space generated by all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by anexponentially decaying factor of their full length in the text, hence emphasising those occurrences that are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be e ciently evaluated by a dynamic programming technique. Experimental comparisons of the performance of the kernel compared with a standard word feature space kernel Joachims (1998) show positive results on modestly sized datasets. The case of contiguous subsequences is also considered for comparison with the subsequences kernel with di erent decay factors. For larger documents and datasets the paper introduces an approximation technique that is shown to deliver good approximations e ciently for large datasets.

Keyphrases

text classification    string kernel    feature space    inner product    large datasets    feature vector    prohibitive amount    di erent decay factor    good approximation    full length    contiguous subsequence    dynamic programming technique    text document    special kernel    modest value    novel approach    experimental comparison    positive result    approximation technique    direct computation    ordered sequence   

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University