Text categorization of low quality images (1995)
| Venue: | In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval |
| Citations: | 52 - 2 self |
BibTeX
@INPROCEEDINGS{Ittner95textcategorization,
author = {David J. Ittner and David D. Lewis Y and David D. Ahn Z},
title = {Text categorization of low quality images},
booktitle = {In Proceedings of SDAIR-95, 4th Annual Symposium on Document Analysis and Information Retrieval},
year = {1995},
pages = {301--315}
}
Years of Citing Articles
OpenURL
Abstract
Categorization of text images into content-oriented classes would be a useful capability in a variety of document handling systems. Many methods can be usedtocategorize texts once their words are known, but OCR can garble a large proportion of words, particularly when low quality images are used. Despite this, we show for one data set that fax quality images can be categorized with nearly the same accuracy as the original text. Further, the categorization system can be trained on noisy OCR output, without need for the true text of any image, or for editing of OCR output. The useofavector space classi er and training method robust to large feature sets, combined with discarding of low frequency OCR output strings are the key to our approach. 1







