(Enter summary)
Abstract: Several methods for classifying and segmenting text are described. These are based on ranking text sequences by their cross-entropy calculated using a fixed order character-based Markov model adapted from the PPM text compression algorithm. Experimental results show that the methods are a signi cant improvement over previously used methods in a number of areas. For example, text can be classified with a very high degree of accuracy by authorship, language, dialect and genre. Highly accurate... (Update)
Context of citations to this paper: More
.... have also been cited as possible candidates for this problem [37] A compression based method, similar to ours, has been described in [36]. This paper describes the use of PPM based compression techniques for resolving the authorship attribution problem for two authors as...
Cited by: More
Spam Filtering using Character-level Markov Models: - Experiments For The
(Correct)
Using Compression For Source Based Classification Of Text - Thaper (2001)
(Correct)
Similar documents (at the sentence level):
8.2%: Correcting English text using PPM models - Teahan, Inglis, Cleary, Holmes (1998)
(Correct)
6.4%: Unbounded Length Contexts for PPM - Cleary, Teahan, Witten (1995)
(Correct)
Active bibliography (related documents): More All
0.6: A Compression-based Algorithm for Chinese Word Segmentation - Teahan, Wen, McNab, Witten (2000)
(Correct)
0.3: Models of English text - Teahan, Cleary (1997)
(Correct)
0.3: Segmenting Documents by Stylistic Character - Graham, Hirst, Marthi (2004)
(Correct)
Similar documents based on text: More All
0.5: Using Literal and Grammatical Statistics for.. - Kukushkina.. (2002)
(Correct)
0.3: Combining PPM models using a text mining approach - Teahan, Harper (2001)
(Correct)
0.3: Using Markov Chains for Identification of Writers - Khmelev, Tweedie (2002)
(Correct)
Related documents from co-citation: More All
2: Text categorization using compression models
- Frank, Chui et al. - 2000
2: Data compression using adaptive coding and partial string matching
- Cleary, Witten - 1984
BibTeX entry: (Update)
W. J. Teahan. Text classification and segmentation using minimum crossentropy. In International Conference on Content-Based Multimedia Information Access (RIAO), 2000. http://citeseer.ist.psu.edu/teahan00text.html More
@inproceedings{ teahan00text,
author = "William J. Teahan",
title = "Text classification and segmentation using minimum cross-entropy",
booktitle = "Proceeding of {RIAO}-00, 6th International Conference ``Recherche d'Information Assistee par Ordinateur''",
address = "Paris, FR",
year = "2000",
url = "citeseer.ist.psu.edu/teahan00text.html" }
Citations (may not include all citations):
3972
Introduction to algorithms (context) - Cormen, Leiserson et al. - 1990
1447
A mathematical theory of communication (context) - Shannon - 1948
368
Text compression (context) - Bell, Cleary et al. - 1990
337
Error bounds for convolutional codes and an asymptotically o.. (context) - Viterbi - 1967
328
A maximum likelihood approach to continuous speech recogniti.. (context) - Bahl, Jelinek et al. - 1983
274
Estimation of probabilities from sparse data for the languag.. (context) - Katz - 1987
128
Self-organized language modeling for speech recognition (context) - Jelinek - 1990
108
Prediction and entropy of printed English (context) - Shannon - 1951
104
Techniques for automatically correcting words in text (context) - Kukich - 1992
78
Frequency analysis of English usage: lexicon and grammar (context) - Francis, Ku - 1982
64
A tree-based statistical language model for natural language.. (context) - Bahl, Brown et al. - 1989
61
Unbounded length contexts for PPM
- Cleary, Teahan - 1997
49
An estimate of an upper bound for the entropy of English (context) - Brown, Della et al. - 1992
39
Part-of-speech tagging with neural networks
- Schmid - 1994
36
The computational analysis of English (context) - Garside, Leech et al. - 1987
32
A comparison of event models for Naive Bayes text classicati.. (context) - McCallum, Nigam - 1998
22
Applied Bayesian and classical inference: the case of the Fe.. (context) - Mosteller, Wallace - 1984
20
Sequential coding algorithms: A survey and cost analysis (context) - Anderson, Mohan - 1984
16
A spelling correction program based on a noisy channel model (context) - Kernighan, Church et al. - 1990
15
USeg: A retargetable word segmentation procedure for informa..
- Ponte, Croft - 1996
15
Modelling English text
- Teahan - 1998
14
The entropy of English using PPM-based models
- Teahan, Cleary - 1996
14
Disambiguation of prepositional phrases in automatically lab.. (context) - Boggess, Agarwal et al. - 1991
11
A compression-based algorithm for Chinese word segmentation
- Teahan, Wen et al. - 2000
10
The design and analysis of ecient lossless data compression .. (context) - Howard - 1997
8
Correcting English text using PPM models
- Teahan, Inglis et al. - 1998
8
Improving text classication by shrinkage in a hierarchy of c..
- McCallum, Rosenfeld et al. - 1998
7
in ACM Transactions on Information Systems (context) - Lewis, Hayes - 1994
5
The tagged LOB Corpus (context) - Johansson, Atwell et al. - 1986
3
Error-driven learning of Chinese word segmentation
- Hockenmaier, Brew - 1998
3
What Can We Do With Small Corpora (context) - Juola - 1998
1
Statistical techniques for language recognition: an introduc.. (context) - Ganeson, Sherman - 1993
1
Statistical identication of language (context) - Dunning - 1994
1
Mining on-line text (context) - Knight - 1999
1
Proceedings of the 33rd Annual Meeting of the ACL (context) - Ristad, Thomas - 1995
Documents on the same site (http://133.23.229.11/~ysuzuki/Proceedingsall/RIAO2000/): More
Short-circuiting information overload in documents - the HINTS.. - Burnett (2000)
(Correct)
Statistical Consistency of Keywords Dictionary Parameters - Martynenko (2000)
(Correct)
Lexical Cohesion, Discourse Segmentation and Document.. - Boguraev, Neff (2000)
(Correct)
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC