Results 1 - 10
of
56
The OCRopus Open Source OCR System
"... OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms curre ..."
Abstract
-
Cited by 33 (8 self)
- Add to MetaCart
OCRopus is a new, open source OCR system emphasizing modularity, easy extensibility, and reuse, aimed at both the research community and large scale commercial document conversions. This paper describes the current status of the system, its general architecture, as well as the major algorithms currently being used for layout analysis and text line recognition. 1.
Performance comparison of six algorithms for page segmentation
- in 7th IAPR Workshop on Document Analysis Systems
, 2006
"... Abstract. This paper presents a quantitative comparison of six algorithms for page segmentation: X-Y cut, smearing, whitespace analysis, constrained text-line finding, Docstrum, and Voronoi-diagram-based. The evaluation is performed using a subset of the UW-III collection commonly used for evaluatio ..."
Abstract
-
Cited by 32 (8 self)
- Add to MetaCart
(Show Context)
Abstract. This paper presents a quantitative comparison of six algorithms for page segmentation: X-Y cut, smearing, whitespace analysis, constrained text-line finding, Docstrum, and Voronoi-diagram-based. The evaluation is performed using a subset of the UW-III collection commonly used for evaluation, with a separate training set for parameter optimization. We compare the results using both default parameters and optimized parameters. In the course of the evaluation, the strengths and weaknesses of each algorithm are analyzed, and it is shown that no single algorithm outperforms all other algorithms. However, we observe that the three best-performing algorithms are those based on constrained text-line finding, Docstrum, and the Voronoi-diagram. 1
Performance Evaluation and Benchmarking of Six Page Segmentation Algorithms
, 2007
"... Informative benchmarks are crucial for optimizing the page segmentation step of an OCR system, frequently the performance limiting step for overall OCR system performance. We show that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify ..."
Abstract
-
Cited by 30 (20 self)
- Add to MetaCart
(Show Context)
Informative benchmarks are crucial for optimizing the page segmentation step of an OCR system, frequently the performance limiting step for overall OCR system performance. We show that current evaluation scores are insufficient for diagnosing specific errors in page segmentation and fail to identify some classes of serious segmentation errors altogether. This paper introduces a vectorial score that is sensitive to, and identifies, the most important classes of segmentation errors (over-, under-, and miss-segmentation) and what page components (lines, blocks, etc.) are affected. Unlike previous schemes, our evaluation method has a canonical representation of ground truth data and guarantees pixel-accurate evaluation results for arbitrary region shapes. We present the results of evaluating widely used seg-mentation algorithms (x-y cut, smearing, whitespace analysis, constrained text-line finding, docstrum, and Voronoi) on the UW-III database and demonstrate that the new evaluation scheme permits the identification of several specific flaws in individual segmentation methods. Index Terms Document page segmentation, OCR, performance evaluation, performance metric
Pixel-accurate representation and evaluation of page segmentation in document images
- In 18th Int. Conf. on Pattern Recognition
, 2006
"... This paper presents a new representation and evaluation procedure of page segmentation algorithms and analyzes six widely-used layout analysis algorithms using the procedure. The method permits a detailed analysis of the behavior of page segmentation algorithms in terms of over- and undersegmentatio ..."
Abstract
-
Cited by 23 (13 self)
- Add to MetaCart
(Show Context)
This paper presents a new representation and evaluation procedure of page segmentation algorithms and analyzes six widely-used layout analysis algorithms using the procedure. The method permits a detailed analysis of the behavior of page segmentation algorithms in terms of over- and undersegmentation at different layout levels, as well as determination of the geometric accuracy of the segmentation. The representation of document layouts relies on labeling each pixel according to its function in the overall segmentation, permitting pixel-accurate representation of layout information of arbitrary layouts and allowing background pixels to be classified as “don’t care”. Our representations can be encoded easily in standard color image formats like PNG, permitting easy interchange of segmentation results and ground truth. 1.
Logical Structure Recovery in Scholarly Articles with Rich Document Features
"... Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation and summarization. We describe SectLabel, a module that further develops e ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Scholarly digital libraries increasingly provide analytics to information within documents themselves. This includes information about the logical document structure of use to downstream components, such as search, navigation and summarization. We describe SectLabel, a module that further develops existing software to detect the logical structure of a document from existing PDF files, using the formalism of conditional random fields. While previous work has assumed access only to the raw text representation of the document, a key aspect of our work is to integrate the use of a richer representation of the document that includes features from optical character recognition (OCR), such as font size and text position. Our experiments reveal that using such rich features improves logical structure detection by a significant 9 F1 points, over a suitable baseline, motivating the use of richer document representations in other digital library applications.
Automatic categorization of figures in scientific documents
- In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
, 2006
"... Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
(Show Context)
Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for realworld use. Our tools will be integrated into a scientificdocument digital library.
Learning Non-Generative Grammatical Models for Document Analysis
"... We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to di ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to discriminatively select features and set all parameters in the parsing process. Therefore, and unlike many other approaches for layout analysis, ours can easily adapt itself to a variety of document analysis problems. One need only specify the page grammar and provide a set of correctly labeled pages. We apply this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation. Experiments demonstrate that the learned grammars can be used to extract the document structure in 57 files from the UWIII document image database. We also show that the same framework can be used to automatically interpret printed mathematical expressions so as to recreate the original LaTeX.
A statistical learning approach to document image analysis
- In 8th International Conference on Document Analysis and Recognition
, 2005
"... In the field of computer analysis of document images, the problems of physical and logical layout analysis have been approached through a variety of heuristic, rule-based, and grammar-based techniques. In this paper we investigate the effectiveness of statistical pattern recognition algorithms for s ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
In the field of computer analysis of document images, the problems of physical and logical layout analysis have been approached through a variety of heuristic, rule-based, and grammar-based techniques. In this paper we investigate the effectiveness of statistical pattern recognition algorithms for solving these two problems, and report results suggesting that these more complex and powerful techniques are worth pursuing. First, we developed a new software environment for manual page image segmentation and labeling, and used it to create a dataset containing 932 page images from academic journals. Next, a physical layout analysis algorithm based on a logistic regression classifier was developed, and found to outperform existing algorithms of comparable complexity. Finally, three statistical classifiers were applied to the logical layout analysis problem, also with encouraging results. 1. Background Document image understanding is the process of automatically extracting useful information from page images. The problem can be broken down into a series of subproblems, with each working from the results of the previous. Two key steps in such a series are segmentation (or page physical structure analysis), in which regions of ink are identified, and labeling (or page logical structure analysis), in which these regions are assigned meaningful labels. Many existing algorithms for segmentation and labeling involve heuristic or rule-based approaches[4], sometimes enhanced by decision trees[1] or page grammars[2]. While these approaches have met with some success, they do not always generalize well to documents outside the development set. In addition, the complexity of these techniques makes them difficult to replicate, making quantitative performance comparisons nearly impossible. In response to these difficulties, Song Mao et al. have called for an approach to these problems based on formal models[4]. They note specific advantages including the ability to estimate parameter values for the model from training data, and the possibility of selecting a model of appropriate
Document Logical Structure Analysis Based on Perceptive
- Cycles, 7th IAPR Workshop on Document Analysis Systems - DAS 2006
, 2006
"... Abstract. This paper describes a Neural Network (NN) approach for logical document structure extraction. In this NN architecture, called Transparent Neural Network (TNN), the document structure is stretched along the layers, allowing an interpretation decomposition from physical (NN input) to logica ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
Abstract. This paper describes a Neural Network (NN) approach for logical document structure extraction. In this NN architecture, called Transparent Neural Network (TNN), the document structure is stretched along the layers, allowing an interpretation decomposition from physical (NN input) to logical (NN output) level. The intermediate layers represent successive interpretation steps. Each neuron is apparent and associated to a logical element. The recognition proceeds by repetitive perceptive cycles propagating the information through the layers. In case of low recognition rate, an enhancement is achieved by error backpropagation leading to correct or pick up a more adapted input feature subset. Several feature subsets are created using a modified filter method. The first experiments performed on scientific documents are encouraging. 1
Layout analysis for arabic historical document images using machine learning
- in Proceedings of the International Conference on Frontiers in Handwriting Recognition
, 2012
"... Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page mar-gins (a.k.a side-notes text) from manuscripts with com-plex layout format. Simple and discriminative features are extracted in a connected-componen ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
Page layout analysis is a fundamental step of any document image understanding system. We introduce an approach that segments text appearing in page mar-gins (a.k.a side-notes text) from manuscripts with com-plex layout format. Simple and discriminative features are extracted in a connected-component level and sub-sequently robust feature vectors are generated. Multi-layer perception classifier is exploited to classify con-nected components to the relevant class of text. A voting scheme is then applied to refine the resulting segmenta-tion and produce the final classification. In contrast to state-of-the-art segmentation approaches, this method is independent of block segmentation, as well as pixel level analysis. The proposed method has been trained and tested on a dataset that contains a variety of com-plex side-notes layout formats, achieving a segmenta-tion accuracy of about 95%. 1