Spam Filtering using Character-level Markov Models: Experiments for the TREC 2005 Spam Track (2005) [6 citations — 3 self]
Abstract:
This paper summarizes our participation in the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. We experimented with two different compression algorithms under varying model parameters. All four filters that we submitted exhibited strong performance in the official evaluation, indicating that data compression models are well suited to the spam filtering problem. 1
Citations
| 251 | Data Compression Using Adaptive Coding and Partial String Matching – Cleary, Witten - 1984 |
| 44 | The Design and Analysis of Efficient Lossless Data Compression Systems – Howard - 1993 |
| 15 | Text categorization using compression models – Frank, Chui, et al. - 2000 |
| 14 | Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited Email Messages (SPAM – Rigoutsos, Huynh - 2004 |
| 11 | Spam corpus creation for trec – Cormack, Lynam - 2005 |
| 11 | Text classification and segmentation using minimum cross-entropy – Teahan - 2000 |
| 8 | Spam filtering using compression models – Bratko, Filipic - 2005 |
| 8 | A study of supervised spam detection applied to eight months of personal email – Cormack, Lynam - 2004 |
| 4 | Context tree weighting: Multi-alphabet sources – Tjalkens, Shtarkov, et al. - 1993 |
| 3 | Introduction to Data Compression, Chapter 6.2.4 – Sayood - 2000 |

