Results 1 -
7 of
7
AIDAS: Incremental Logical Structure Discovery in PDF Documents
- In 6th International Conference on Document Analysis and Recognition (ICDAR
, 2001
"... We describe the approach AIDAS uses to extract the logical document structure from PDF documents. The approach is based on the idea that the layout structure contains cues about the logical structure and that the logical structure can be discovered incrementally. ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
We describe the approach AIDAS uses to extract the logical document structure from PDF documents. The approach is based on the idea that the layout structure contains cues about the logical structure and that the logical structure can be discovered incrementally.
Document Structure Analysis and Performance Evaluation
, 1999
"... Document Structure Analysis and Performance Evaluation by Jisheng Liang Chair of Supervisory Committee Professor Robert M. Haralick Electrical Engineering The goal of the document structure analysis is to find an optimal solution to partition the set of glyphs on a given document to a hierarchical t ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Document Structure Analysis and Performance Evaluation by Jisheng Liang Chair of Supervisory Committee Professor Robert M. Haralick Electrical Engineering The goal of the document structure analysis is to find an optimal solution to partition the set of glyphs on a given document to a hierarchical tree structure where entities within the hierarchy are associated with their physical properties and semantic labels. In this dissertation, we present a unified document structure extraction algorithm that is probability based, where the probabilities are estimated from an extensive training set of various kinds of measurements of distances between the terminal and non-terminal entities with which the algorithm works. The off-line probabilities estimated in the training then drive all decisions in the on-line segmentation module. An iterative, relaxation like method is used to find the partitioning solution that maximizes the joint probability. This approach can be uniformly apply to the cons...
Combining Visual Layout and Lexical Cohesion Features for Text Segmentation
, 2001
"... We propose integrating features from lexical cohesion with elements from layout recognition to build a composite framework. We use supervised machine learning on this composite feature set to derive discourse structure on the topic level. We demonstrate a system based on this principle and use both ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We propose integrating features from lexical cohesion with elements from layout recognition to build a composite framework. We use supervised machine learning on this composite feature set to derive discourse structure on the topic level. We demonstrate a system based on this principle and use both an intrinsic evaluation as well as the task of genre classication to assess its performance. 2 Introduction A document structure tree 1 can be dened as a data structure that allows navigation of a document by sections. These trees can be hierarchically organized, having subsections of sections and may embed special items, such as gures, tables or hyperlinks. They may be used directly by an end user for document access, or indirectly through other applications. This paper describes a strategy to compute document structure using a framework that deals both with rich, semi-structured documents with layout features as well as impoverished, text stream-like documents. Our system, the Comb...
Malerba D.: Mining spatial association rules from document layout structures
- In: Proc. of the 3rd Workshop on Document Layout Interpretation and its Application (DLIA 2003), 2003
, 2003
"... In this paper we investigate the discovery of spatial association rules from a particular kind of images, namely document images. Document images are initially processed to extract both their layout structures and their logical structures. To take into account the inherent spatial nature of the layo ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In this paper we investigate the discovery of spatial association rules from a particular kind of images, namely document images. Document images are initially processed to extract both their layout structures and their logical structures. To take into account the inherent spatial nature of the layout structure, a spatial data mining algorithm is applied, which returns spatial association rules. We present possible applications of spatial association rules detected from document layout. We also illustrate and comment experimental results on a set of multi-page documents extracted by IEEE PAMI. 1.
Hierarchies in HTML Documents: Linking Text to Concepts
- In 15th Int’l Workshop on Database and Expert Systems Applications
, 2004
"... For the successful setting of the Semantic Web, it is necessary to provide tools for linking the large amounts of data that are currently available in HTML documents to the Semantic Web ontologies. Due to the enormous variability of the HTML code, it is very limiting to define direct bindings betwee ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
For the successful setting of the Semantic Web, it is necessary to provide tools for linking the large amounts of data that are currently available in HTML documents to the Semantic Web ontologies. Due to the enormous variability of the HTML code, it is very limiting to define direct bindings between patterns of the HTML code and the concepts. We propose an approach based on modeling the visual part of the rendered document and describing the key characteristics of the data presentation in a general way. As a next step, we propose the way for using this model for locating the instances of the concepts in the document using the approximate tree matching algorithms and regular expressions.
Visual HTML Document Modeling for Information Extraction
- In Proceedings of RAWS 2005
, 2005
"... Current methods of information extraction from HTML documents are mostly based on the discovery of some patterns in the HTML code that are expected to identify a particular information in the document. ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Current methods of information extraction from HTML documents are mostly based on the discovery of some patterns in the HTML code that are expected to identify a particular information in the document.
Automatic Indexing of Documents With Ontologies
- In 13th Belgian/Dutch Conference on Artificial Intelligence (BNAIC
, 2001
"... Indexing large bodies of data is necessary to enable satisfactory search results. ..."
Abstract
- Add to MetaCart
Indexing large bodies of data is necessary to enable satisfactory search results.

