Results 1 -
5 of
5
Zone Content Classification And Its Performance Evaluation
- Pattern Recognition
, 2001
"... This paper presents an improved zone content classification method and its performance evaluation. We added two new features to the feature vector from one previously published method [1]. We assumed different independence relationship in two zone sets. We used an optimized binary decision tree to e ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper presents an improved zone content classification method and its performance evaluation. We added two new features to the feature vector from one previously published method [1]. We assumed different independence relationship in two zone sets. We used an optimized binary decision tree to estimate the maximum zone content class probability in one set while used Viterbi algorithm to find the optimal solution for a zone sequence in the other set. The training, pruning and testing data set for the algorithm include images drawn from the UWCDROM III document image database. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, text classes (of font size pt and font size pt), math, table, halftone, map/drawing, ruling, logo, and others. Compared with our previous work [2], it raised the accuracy rate to and reduced the mean false alarm rate to .
Design of an End-to-End Method to Extract Information From Tables
- International Journal Document Analysis Research
"... This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extrac ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper plans an end-to-end method for extracting information from tables embedded in documents; input format is ASCII, to which any richer format can be converted, preserving all textual and much of the layout information. We start by defining table. Then we describe the steps involved in extracting information from tables and analyse table-related research to: place the contribution of different authors, find the paths research is following, and identify issues that are still unsolved. We then analyse current approaches to evaluating table processing algorithms and propose two new metrics for the task of segmenting cells/columns/rows. We proceed to design our own end-to-end method, where there is a higher interaction between the different steps; we indicate how back loops in the usual order of the steps can reduce the possibility of errors and contribute to solving previously unsolved problems. Finally we explore how the actual interpretation of the table not only allows inferring the accuracy of the overall extraction process but also contributes to actually improving its quality. In order to do so, we believe interpretation has to consider context specific knowledge; we explore how the addition of this knowledge can be made in a plug-in/out manner, such that the overall method will maintain its operability in different contexts.
Document image zone classification - a simple high-performance approach
- in 2nd Int. Conf. on Computer Vision Theory and Applications
, 2007
"... We describe a simple, fast, and accurate system for document image zone classification — an important subproblem of document image analysis — that results from a detailed analysis of different features. Using a novel combination of known algorithms, we achieve a very competitive error rate of 1.46 % ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We describe a simple, fast, and accurate system for document image zone classification — an important subproblem of document image analysis — that results from a detailed analysis of different features. Using a novel combination of known algorithms, we achieve a very competitive error rate of 1.46 % (n = 13811) in comparison to (Wang et al., 2006) who report an error rate of 1.55 % (n = 24177) using more complicated techniques. The experiments were performed on zones extracted from the widely used UW-III database, which is representative of images of scanned journal pages and contains ground-truthed real-world data. 1
A study on the document zone content classification problem
- In Fifth IAPR International Workshop on Document Analysis Systems
, 2002
"... Abstract. A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding syste ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. A document can be divided into zones on the basis of its content. For example, a zone can be either text or non-text. Given the segmented document zones, correctly determining the zone content type is very important for the subsequent processes within any document image understanding system. This paper describes an algorithm for the determination of zone type of a given zone within an input document image. In our zone classification algorithm, zones are represented as feature vectors. Each feature vector consists of a set of 25 measurements of pre-defined properties. A probabilistic model, decision tree, is used to classify each zone on the basis of its feature vector. Two methods are used to optimize the decision tree classifier to eliminate the data over-fitting problem. To enrich our probabilistic model, we incorporate context constraints for certain zones within their neighboring zones. We also model zone class context constraints as a Hidden Markov Model and used Viterbi algorithm to obtain optimal classification results. The training, pruning and testing data set for the algorithm include 1, 600 images drawn from the UWCDROM-III document image database. With a total of 24, 177 zones within the data set, the cross-validation method was used in the performance evaluation of the classifier. The classifier is able to classify each given scientific and technical document zone into one of the nine classes, 2 text classes (of font size 4−18pt and font size 19−32 pt), math, table, halftone, map/drawing, ruling, logo, and others. A zone content classification performance evaluation protocol is proposed. Using this protocol, our algorithm accuracy is 98.45 % with a mean false alarm rate of 0.50%. 1
Table Detection via Probability Optimization
- in Proceedings of Document Analysis Systems, (DAS’02
, 2002
"... This paper presents a table detection algorithm using optimization method. We define the table detection problem within the whole page segmentation framework. To reach a good table detection result, we emphasize to optimize the probabilities of the table region , its neighboring text block and their ..."
Abstract
- Add to MetaCart
This paper presents a table detection algorithm using optimization method. We define the table detection problem within the whole page segmentation framework. To reach a good table detection result, we emphasize to optimize the probabilities of the table region , its neighboring text block and their separator. An iterative updating method is used to optimize the whole page segmentation probability. The training and testing data set for the algorithm include document pages having in table entities and a total of cell entities. Compared with our previous work [12], it raised the accuracy rate to 461-3 from 9-361 and to 13-36 from 67-36 . 1

