See this document in CiteSeerX!

Text Classification and Segmentation Using Minimum Cross-Entropy (2000)  (Make Corrections)  (2 citations)
W.J. Teahan
Proceeding of RIAO-00, 6th International Conference ``Recherche d'Information Assistee par Ordinateur''



  Home/Search   Context   Related

 
View or download:
133.23.229.11/~ysuzuki/Proce...78DO3.ps
Cached:  PS.gz  PS  PDF   Image  Update  Help

From:  133.23.229.11/~ysuzuki/Proceed... (more)
(Enter author homepages)

Rate this article: (best)
  Comment on this article  
(Enter summary)

Abstract: Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate... (Update)

Context of citations to this paper:   More

.... have also been cited as possible candidates for this problem [37] A compression based method, similar to ours, has been described in [36]. This paper describes the use of PPM based compression techniques for resolving the authorship attribution problem for two authors as...

Cited by:   More
Spam Filtering using Character-level Markov Models: - Experiments For The   (Correct)
Using Compression For Source Based Classification Of Text - Thaper (2001)   (Correct)

Similar documents (at the sentence level):
8.2%:   Correcting English text using PPM models - Teahan, Inglis, Cleary, Holmes (1998)   (Correct)
6.4%:   Unbounded Length Contexts for PPM - Cleary, Teahan, Witten (1995)   (Correct)

Active bibliography (related documents):   More   All
0.6:   A Compression-based Algorithm for Chinese Word Segmentation - Teahan, Wen, McNab, Witten (2000)   (Correct)
0.3:   Models of English text - Teahan, Cleary (1997)   (Correct)
0.3:   Segmenting Documents by Stylistic Character - Graham, Hirst, Marthi (2004)   (Correct)

Similar documents based on text:   More   All
0.5:   Using Literal and Grammatical Statistics for.. - Kukushkina.. (2002)   (Correct)
0.3:   Combining PPM models using a text mining approach - Teahan, Harper (2001)   (Correct)
0.3:   Using Markov Chains for Identification of Writers - Khmelev, Tweedie (2002)   (Correct)

Related documents from co-citation:   More   All
2:   Text categorization using compression models - Frank, Chui et al. - 2000
2:   Data compression using adaptive coding and partial string matching - Cleary, Witten - 1984

BibTeX entry:   (Update)

W. J. Teahan. Text classification and segmentation using minimum crossentropy. In International Conference on Content-Based Multimedia Information Access (RIAO), 2000. http://citeseer.ist.psu.edu/teahan00text.html   More

@inproceedings{ teahan00text,
    author = "William J. Teahan",
    title = "Text classification and segmentation using minimum cross-entropy",
    booktitle = "Proceeding of {RIAO}-00, 6th International Conference ``Recherche d'Information Assistee par Ordinateur''",
    address = "Paris, FR",
    year = "2000",
    url = "citeseer.ist.psu.edu/teahan00text.html" }
Citations (may not include all citations):
3972   Introduction to algorithms (context) - Cormen, Leiserson et al. - 1990
1447   A mathematical theory of communication (context) - Shannon - 1948
368   Text compression (context) - Bell, Cleary et al. - 1990
337   Error bounds for convolutional codes and an asymptotically o.. (context) - Viterbi - 1967
328   A maximum likelihood approach to continuous speech recogniti.. (context) - Bahl, Jelinek et al. - 1983
274   Estimation of probabilities from sparse data for the languag.. (context) - Katz - 1987
128   Self-organized language modeling for speech recognition (context) - Jelinek - 1990
108   Prediction and entropy of printed English (context) - Shannon - 1951
104   Techniques for automatically correcting words in text (context) - Kukich - 1992
78   Frequency analysis of English usage: lexicon and grammar (context) - Francis, Ku - 1982
64   A tree-based statistical language model for natural language.. (context) - Bahl, Brown et al. - 1989
61   Unbounded length contexts for PPM - Cleary, Teahan - 1997
49   An estimate of an upper bound for the entropy of English (context) - Brown, Della et al. - 1992
39   Part-of-speech tagging with neural networks - Schmid - 1994
36   The computational analysis of English (context) - Garside, Leech et al. - 1987
32   A comparison of event models for Naive Bayes text classicati.. (context) - McCallum, Nigam - 1998
22   Applied Bayesian and classical inference: the case of the Fe.. (context) - Mosteller, Wallace - 1984
20   Sequential coding algorithms: A survey and cost analysis (context) - Anderson, Mohan - 1984
16   A spelling correction program based on a noisy channel model (context) - Kernighan, Church et al. - 1990
15   USeg: A retargetable word segmentation procedure for informa.. - Ponte, Croft - 1996
15   Modelling English text - Teahan - 1998
14   The entropy of English using PPM-based models - Teahan, Cleary - 1996
14   Disambiguation of prepositional phrases in automatically lab.. (context) - Boggess, Agarwal et al. - 1991
11   A compression-based algorithm for Chinese word segmentation - Teahan, Wen et al. - 2000
10   The design and analysis of ecient lossless data compression .. (context) - Howard - 1997
8   Correcting English text using PPM models - Teahan, Inglis et al. - 1998
8   Improving text classication by shrinkage in a hierarchy of c.. - McCallum, Rosenfeld et al. - 1998
7   in ACM Transactions on Information Systems (context) - Lewis, Hayes - 1994
5   The tagged LOB Corpus (context) - Johansson, Atwell et al. - 1986
3   Error-driven learning of Chinese word segmentation - Hockenmaier, Brew - 1998
3   What Can We Do With Small Corpora (context) - Juola - 1998
1   Statistical techniques for language recognition: an introduc.. (context) - Ganeson, Sherman - 1993
1   Statistical identication of language (context) - Dunning - 1994
1   Mining on-line text (context) - Knight - 1999
1   Proceedings of the 33rd Annual Meeting of the ACL (context) - Ristad, Thomas - 1995

Documents on the same site (http://133.23.229.11/~ysuzuki/Proceedingsall/RIAO2000/):   More
Short-circuiting information overload in documents - the HINTS.. - Burnett (2000)   (Correct)
Statistical Consistency of Keywords Dictionary Parameters - Martynenko (2000)   (Correct)
Lexical Cohesion, Discourse Segmentation and Document.. - Boguraev, Neff (2000)   (Correct)

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC