Results 1 -
6 of
6
Background Variability Modeling for Statistical Layout Analysis
"... Geometric layout analysis plays an important role in document image understanding. Many algorithms known in literature work well on standard document images, achieving high text line segmentation accuracy on the UW-III dataset. These algorithms rely on certain assumptions about document layouts, and ..."
Abstract
- Add to MetaCart
Geometric layout analysis plays an important role in document image understanding. Many algorithms known in literature work well on standard document images, achieving high text line segmentation accuracy on the UW-III dataset. These algorithms rely on certain assumptions about document layouts, and fail when their underlying assumptions are not met. Also, they do not provide confidence scores for their output. These two problems limit the usefulness of general purpose layout analysis methods in large scale applications. In this contribution, we propose a statistically motivated model-based trainable layout analysis system that allows assumption-free adaptation to different layout types and produces likelihood estimates of the correctness of the computed page segmentation. The performance of our approach is tested on a subset of the Google 1000 books dataset where it achieved a text line segmentation accuracy of 98.4 % on layouts where other generalpurpose algorithms failed to do a correct segmentation. 1
2009 10th International Conference on Document Analysis and Recognition A Realistic Dataset for Performance Evaluation of Document Layout Analysis †
"... Abstract † There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on ..."
Abstract
- Add to MetaCart
Abstract † There is a significant need for a realistic dataset on which to evaluate layout analysis methods and examine their performance in detail. This paper presents a new dataset (and the methodology used to create it) based on a wide range of contemporary documents. Strong emphasis is placed on comprehensive and detailed representation of both complex and simple layouts, and on colour originals. In-depth information is recorded both at the page and region level. Ground truth is efficiently created using a new semi-automated tool and stored in a new comprehensive XML representation, the PAGE format. The dataset can be browsed and searched via a web-based front end to the underlying database and suitable subsets (relevant to specific evaluation goals) can be selected and downloaded. 1
Picture Detection in Document Page Images
"... We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to separate out the text and applies the Normalized Cuts algorithm to cluster the non-text pixels into picture regions. A ref ..."
Abstract
- Add to MetaCart
We present a method for picture detection in document page images, which can come from scanned or camera images, or rendered from electronic file formats. Our method uses OCR to separate out the text and applies the Normalized Cuts algorithm to cluster the non-text pixels into picture regions. A refinement step uses the captions found in the OCR text to deduce how many pictures are in a picture region, thereby correcting for under- and over-segmentation. A performance evaluation scheme is applied which takes into account the detection quality and fragmentation quality. We benchmark our method against the ABBYY application on page images from conference papers. Categories and Subject Descriptors
2009 10th International Conference on Document Analysis and Recognition Coupled Snakelet Model for Curled Textline Segmentation of Camera-Captured Document Images
"... Detection of curled textline is important for dewarping of hand-held camera-captured document images. Then baselines and the lines following the top of x-height of characters (x-lines) are estimated for dewarping. Existing curled textline segmentation approaches are sensitive to outlier points and p ..."
Abstract
- Add to MetaCart
Detection of curled textline is important for dewarping of hand-held camera-captured document images. Then baselines and the lines following the top of x-height of characters (x-lines) are estimated for dewarping. Existing curled textline segmentation approaches are sensitive to outlier points and perspective distortions. Furthermore these approaches use regression over top and bottom points of a segmented textline to estimate its x-line and baseline separately, which may results in inaccurate estimation. Here we propose a novel curled textline segmentation approach based on active contours (snakes) in which we perform segmentation by estimating the pairs of x-line and baseline; solving both problems together. Starting form a connected component we jointly trace a pair of x-line and baseline using coupled snakes and external energies of neighboring top-bottom points. We grow neighborhood region iteratively during tracing, which results in robustness to perspective distortions, and maintain a natural property of similar distance within the pair of x-line and baseline pair, which results in robustness to outlier points. We achieved 90.76% of one-to-one match-score recognition accuracy of curled textline segmentation on CBDAR 2007 Document Image Dewarping Contest dataset, with good estimation of pairs of x-line and baseline. 1
Recognition and Retrieval of Mathematical Expressions
- INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION
"... Document recognition and retrieval technologies complement one another, providing improved access to increasingly large document collections. While recognition and retrieval of textual information is fairly mature, with wide-spread availability of Optical Character Recognition (OCR) and text-based ..."
Abstract
- Add to MetaCart
Document recognition and retrieval technologies complement one another, providing improved access to increasingly large document collections. While recognition and retrieval of textual information is fairly mature, with wide-spread availability of Optical Character Recognition (OCR) and text-based search engines, recognition and retrieval of graphics such as images, figures, tables, diagrams, and mathematical expressions are in comparatively early stages of research. This paper surveys the state of the art in recognition and retrieval of mathematical expressions, organized around four key problems in math retrieval (query construction, normalization, indexing, and relevance feedback), and four key problems in math recognition (detecting expressions, detecting and classifying symbols, analyzing symbol layout, and constructing a representation of meaning). Of special interest is the machine learning problem of jointly optimizing the component algorithms in a math recognition system, and developing effective indexing, retrieval and relevance feedback algorithms for math retrieval. Another important open problem is developing user interfaces that seamlessly integrate recognition and retrieval. Activity in these important research areas is increasing, in part because math notation provides an excellent domain for studying problems common to many document and graphics recognition and retrieval applications, and also because mature applications will likely provide substantial benefits for education, research, and mathematical literacy.
Cairo University
"... Information Cairo University Text and not-text segmentation and text line extraction from document images are the most challenging problems of information indexing of Arabic document images such as books, technical articles, business letters and faxes in order to successfully process them in systems ..."
Abstract
- Add to MetaCart
Information Cairo University Text and not-text segmentation and text line extraction from document images are the most challenging problems of information indexing of Arabic document images such as books, technical articles, business letters and faxes in order to successfully process them in systems such as OCR. Researches on Arabic language related to documents digitization have been focusing on word and handwriting recognition. Few approaches have been proposed for layout analysis for Arabic scanned/captured documents. In this paper we present a page segmentation method that deals with the complexity of the Arabic language characteristics and fonts using the combination between two algorithms. The first method is the Run length Smoothing. The second method is the Connected Component Labeling algorithm for text and non-text classification using SVM. The combination of the two methods is based on Anding and Oring operations between the outputs of the two methods based on certain conditions. Then, dynamic horizontal projection based on dynamic updating of the threshold to commensurate with the noise associated with different documents and in between text lines. The performance evaluation is performed using manually generated ground truth representations from a dataset of Arabic document images captured using cameras and a hardware built for this purpose. Evaluation and experimental results demonstrate that the proposed text extraction method is independent from different document size, text size, font, shape, and is robust to Arabic document segmentation and text lines extraction.

