Results 1 -
6 of
6
Cut and Paste
, 1998
"... The paper develops Editor, a language for manipulating semi-structured documents, such as the ones typically available on the Web. Editor programs are based on two simple ideas, taken from text editors: "search" instructions are used to select regions of interest in a document, and "cut & paste" to ..."
Abstract
-
Cited by 71 (10 self)
- Add to MetaCart
The paper develops Editor, a language for manipulating semi-structured documents, such as the ones typically available on the Web. Editor programs are based on two simple ideas, taken from text editors: "search" instructions are used to select regions of interest in a document, and "cut & paste" to restructure them. We study the expressive power and the complexity of these programs. We show that they are computationally complete, in the sense that any computable document restructuring can be expressed in Editor. We also study the complexity of a safe subclass of programs, showing that it captures exactly the class of polynomial-time restructurings. The language has been implemented in Java, and is currently used in the Araneus project as a basis for a wrapper--generation toolkit. 1 Introduction It is well known that databases provide robust technology for querying highly structured data in a flexible and efficient way. Recently, the manipulation of less structured information has als...
Grammars Have Exceptions
, 1998
"... Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
Extending database-like techniques to semi-structured and Web data sources is becoming a prominent research field. These data sources are essentially collections of textual documents. Hence, in this context, one of the key tasks consists in wrapping documents to build database abstractions of their content that can be manipulated using high-level tools. However, the degree of heterogeneity and the lack of structure make standard grammar parsers excessively rigid, and often unable to capture the richness of constructs in these documents. This paper presents Minerva, a formalism for writing wrappers around Web sites and other textual data sources. The key feature of Minerva is the attempt to couple the benefits of a declarative, grammar-based approach, with the flexibility of procedural programming. This is done by enriching regular grammars with an explicit exception-handling mechanism. Contributions of the paper stand in the definition of the formalism, and in the description of its i...
Electronic Style Sheets
, 1992
"... Document processing systems must provide formatted versions of documents, where the specification of formats is the task of the document designer. To match the stylistic quality expected in the traditional publishing process, electronic style sheets need to support the design mechanisms that ha ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Document processing systems must provide formatted versions of documents, where the specification of formats is the task of the document designer. To match the stylistic quality expected in the traditional publishing process, electronic style sheets need to support the design mechanisms that have evolved over the centuries. The designer's craft should not depend on the formatter, in particular it should not involve programming the formatter. We propose four basic mechanisms called transcription types that are sufficient to express a wide range of layouts. Building on these four transcription types, we have defined a layout specification language, Designer, that is declarative and formatter-independent. Putting on the agony, putting on the style. --- Lonnie Donnagon (c. 1960) 1 Introduction In the traditional publishing process, a style sheet is a running account of rules about diction and language usage adopted for a particular manuscript. It is kept by the manuscript edito...
IDEOMS: An Integrated Document Environment based on OMS Object-Oriented Database System
- In Proceedings of the 4th Doctoral Consortium in conjunction with the Conference on Advances Systems Engineering (CAiSE'97
, 1997
"... In this paper, we describe an integrated document management system based on the objectoriented database system OMS. The underlying idea is to replace the current vision of a file system with that of a document database system, thereby offering higher-level user support for the management of all for ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper, we describe an integrated document management system based on the objectoriented database system OMS. The underlying idea is to replace the current vision of a file system with that of a document database system, thereby offering higher-level user support for the management of all forms of computer-stored information and work activities. The objectoriented database system provides access to existing tools such as editors and compilers through object methods, while providing standard database features for managing dependencies between documents, queries and controlled information sharing. In particular, the OMS system supports powerful methods for document classification independent of the type of the documents. This extended database view of a file system allows the users to handle documents at a logical level of abstraction independent of the physical file system. 1 Introduction In most implementations of file systems in modern operating systems, the level of abstracti...
Stochastic Grammatical Inference of Text . . .
- MACHINE LEARNING
, 2000
"... For a document collection in which structural elements are identified with markup, it is often necessary to construct a grammar retrospectively that constrains element nesting and ordering. This has been addressed by others as an application of grammatical inference. We describe an approach based on ..."
Abstract
- Add to MetaCart
For a document collection in which structural elements are identified with markup, it is often necessary to construct a grammar retrospectively that constrains element nesting and ordering. This has been addressed by others as an application of grammatical inference. We describe an approach based on stochastic grammatical inference which scales more naturally to large data sets and produces models with richer semantics. We adopt an algorithm that produces stochastic finite automata and describe modifications that enable better interactive control of results. Our experimental evaluation uses four document collections with varying structure.

