Results 1 -
7 of
7
Quality Assurance in High Volume Document Digitization: A Survey
, 2006
"... quality assurance, document image analysis, OCR, digital library Quality assurance (QA) plays a critical role in high volume document digitization projects by making sure that the specified quality standard is reached under cost and time constraints. This paper takes a systematic view on this issue ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
quality assurance, document image analysis, OCR, digital library Quality assurance (QA) plays a critical role in high volume document digitization projects by making sure that the specified quality standard is reached under cost and time constraints. This paper takes a systematic view on this issue by summarizing and abstracting related existing work: quality bottlenecks and technical solutions throughout the whole processing pipeline, including cataloging, capture, image analysis and recognition, and error cascading; various strategies to conduct costeffective QA, such as combination of auto-QA and manual QA, batch QA, special QA user interface, and open source QA.
Mother Fugger: Mining Historical Manuscripts with Local Color Patches
"... already archived more than ten million books in digital format, and within the next decade the majority of world’s books will be online. Although most of the data will naturally be text, there will also be tens of millions of pages of images, many in color. While there is an active research communit ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
already archived more than ten million books in digital format, and within the next decade the majority of world’s books will be online. Although most of the data will naturally be text, there will also be tens of millions of pages of images, many in color. While there is an active research community pursuing data mining of text from historical manuscripts, there has been very little work that exploits the rich color information which is often present. In this work we introduce a simple color
Interactive degraded document enhancement and ground truth generation
"... Degraded documents are frequently obtained in various situations. Examples of degraded document collections include historical document depositories, document obtained in legal and security investigations, and legal and medical archives. Degraded document images are hard to to read and are hard to a ..."
Abstract
- Add to MetaCart
Degraded documents are frequently obtained in various situations. Examples of degraded document collections include historical document depositories, document obtained in legal and security investigations, and legal and medical archives. Degraded document images are hard to to read and are hard to analyze using computerized techniques. There is hence a need for systems that are capable of enhancing such images. We describe a languageindependent semi-automated system for enhancing degraded document images that is capable of exploiting inter- and intra-document coherence. The system is capable of processing document images with high levels of degradations and can be used for ground truthing of degraded document images. Ground truthing of degraded document images is extremely important in several aspects: it enables quantitative performance measurements of enhancement systems and facilitates model estimation that can be used to improve performance. Performance evaluation is provided using the historical Frieder diaries collection. 1
2009 10th International Conference on Document Analysis and Recognition ORNAMENTAL LETTERS IMAGE CLASSIFICATION USING LOCAL DISSIMILARITY MAPS
"... This article describes a new method for ancient books ornamental letters segmentation and recognition. The purpose of our work is to automatically determine the letter represented in an ornamental letter image. Our process is divided in two parts: a segmentation step of the ornamental letter is foll ..."
Abstract
- Add to MetaCart
This article describes a new method for ancient books ornamental letters segmentation and recognition. The purpose of our work is to automatically determine the letter represented in an ornamental letter image. Our process is divided in two parts: a segmentation step of the ornamental letter is followed by a recognition step. The segmentation process uses multiresolution analysis to filter background decorations followed by a binarisation step and a morphologic reconstruction of the expected letter. The recognition process use the previously obtained reconstruction and compares it with capital letters images used as a dictionary of shapes with the Local Dissimilarity Map (LDM) distance. 1
2009 10th International Conference on Document Analysis and Recognition Spatial and Spectral Based Segmentation of Text in Multispectral Images of Ancient Documents
"... In this paper we propose a character segmentation method for multispectral images of ancient documents. Due to the low quality of the images the main idea of this study is to combine the multispectral behavior and contextual spatial information. Therefore we utilize a Markov Random Field model using ..."
Abstract
- Add to MetaCart
In this paper we propose a character segmentation method for multispectral images of ancient documents. Due to the low quality of the images the main idea of this study is to combine the multispectral behavior and contextual spatial information. Therefore we utilize a Markov Random Field model using the spectral information of the images and stroke properties to include spatial dependencies of the characters. Since the stroke properties and the Gaussian parameters for the imaging model are evaluated automatically the proposed segmentation method requires no training phase. We compared the method to state of the art character segmentation methods and demonstrate the effectiveness of combining spectral and spatial features for the segmentation of characters in multispectral images. 1.
Enhanced Text Extraction from Arabic Degraded Document Images using EM Algorithm
- 10TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION
, 2009
"... This paper presents a new enhanced text extraction algorithm from degraded document images on the basis of the probabilistic models. The observed document image is considered as a mixture of Gaussian densities which represents the foreground and background document image components. The EM algorithm ..."
Abstract
- Add to MetaCart
This paper presents a new enhanced text extraction algorithm from degraded document images on the basis of the probabilistic models. The observed document image is considered as a mixture of Gaussian densities which represents the foreground and background document image components. The EM algorithm is introduced in order to estimate and improve the parameters of the mixtures of densities recursively. The initial parameters of the EM algorithm are estimated by the k-means clustering method. After the parameter estimation, the document image is partitioned into text and background classes by the means of ML approach. The performance of the proposed approach is evaluated on a variety of degraded documents comes from the collections of the National library of Tunisia.
Handwritten Text Image Compression for Indic Script
"... In this paper, compression scheme is presented for Indian Language handwritten text document images. Document image compression is an active area of research. Current OCR technology is not effective for handling the handwritten text images. The proposed compression scheme deals with the handwritten ..."
Abstract
- Add to MetaCart
In this paper, compression scheme is presented for Indian Language handwritten text document images. Document image compression is an active area of research. Current OCR technology is not effective for handling the handwritten text images. The proposed compression scheme deals with the handwritten gray level document in Devnagri script. The method is based on the separation of foreground and background of an image and connected component labeling. Experiments are done with handwritten images in Devnagri (Hindi and Marathi). Compression schemes are available for the printed text in Indian language. But there is little work reported towards the compression standards for handwritten text image. The results of the modules are showing good compression ratio. Hence compression of handwritten text images in Indian language is important.

