Results 1 - 10
of
27
Adding nesting structure to words
- In Developments in Language Theory, LNCS 4036
, 2006
"... We propose the model of nested words for representation of data with both a linear ordering and a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical structure include executions of structured programs, annotated linguistic data, and HTML/XML documents. Neste ..."
Abstract
-
Cited by 119 (17 self)
- Add to MetaCart
(Show Context)
We propose the model of nested words for representation of data with both a linear ordering and a hierarchically nested matching of items. Examples of data with such dual linear-hierarchical structure include executions of structured programs, annotated linguistic data, and HTML/XML documents. Nested words generalize both words and ordered trees, and allow both word and tree operations. We define nested word automata—finite-state acceptors for nested words, and show that the resulting class of regular languages of nested words has all the appealing theoretical properties that the classical regular word languages enjoys: deterministic nested word automata are as expressive as their nondeterministic counterparts; the class is closed under union, intersection, complementation, concatenation, Kleene-*, prefixes, and language homomorphisms; membership, emptiness, language inclusion, and language equivalence are all decidable; and definability in monadic second order logic corresponds exactly to finite-state recognizability. We also consider regular languages of infinite nested words and show that the closure properties, MSO-characterization, and decidability of decision problems carry over. The linear encodings of nested words give the class of visibly pushdown languages of words, and this class lies between balanced languages and deterministic context-free languages. We argue that for algorithmic verification of structured programs, instead of viewing the program as a context-free language over words, one should view it as a regular language of nested words (or equivalently, a visibly pushdown language), and this would allow model checking of many properties (such as stack inspection, pre-post conditions) that are not expressible in existing specification logics. We also study the relationship between ordered trees and nested words, and the corresponding automata: while the analysis complexity of nested word automata is the same as that of classical tree automata, they combine both bottom-up and top-down traversals, and enjoy expressiveness and succinctness benefits over tree automata. 1
Recent advances in a feature-rich framework for treebank annotation
- In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008
, 2008
"... This paper presents recent advances in an established treebank annotation framework comprising of an abstract XMLbased data format, fully customizable editor of tree-based annotations, a toolkit for all kinds of automated data processing with support for cluster computing, and a work-in-progress dat ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
This paper presents recent advances in an established treebank annotation framework comprising of an abstract XMLbased data format, fully customizable editor of tree-based annotations, a toolkit for all kinds of automated data processing with support for cluster computing, and a work-in-progress database-driven search engine with a graphical user interface built into the tree editor. 1
System for Querying Syntactically Annotated Corpora
"... This paper presents a system for querying treebanks. The system consists of a powerful query language with natural support for cross-layer queries, a client interface with a graphical query builder and visualizer of the results, a command-line client interface, and two substitutable query engines: a ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper presents a system for querying treebanks. The system consists of a powerful query language with natural support for cross-layer queries, a client interface with a graphical query builder and visualizer of the results, a command-line client interface, and two substitutable query engines: a very efficient engine using a relational database (suitable for large static data), and a slower, but paralel-computing enabled, engine operating on treebank files (suitable for “live ” data). 1
2008a. Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layers
"... We present an approach for querying collections of heterogeneous linguistic corpora that are annotated on multiple layers using arbitrary XML-based markup languages. An OWL ontology provides a homogenising view on the conceptually different markup languages so that a common querying framework can be ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
(Show Context)
We present an approach for querying collections of heterogeneous linguistic corpora that are annotated on multiple layers using arbitrary XML-based markup languages. An OWL ontology provides a homogenising view on the conceptually different markup languages so that a common querying framework can be established using the method of ontology-based query expansion. In addition, we present a highly flexible web-based graphical interface that can be used to query corpora with regard to several different linguistic properties such as, for example, syntactic tree fragments. This interface can also be used for ontology-based querying of multiple corpora simultaneously. 1.
Mining Syntactically Annotated Corpora with XQuery
"... This paper presents a uniform approach to data extraction from syntactically annotated corpora encoded in XML. XQuery, which incorporates XPath, has been designed as a query language for XML. The combination of XPath and XQuery offers flexibility and expressive power, while corpus specific functions ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
This paper presents a uniform approach to data extraction from syntactically annotated corpora encoded in XML. XQuery, which incorporates XPath, has been designed as a query language for XML. The combination of XPath and XQuery offers flexibility and expressive power, while corpus specific functions can be added to reduce the complexity of individual extraction tasks. We illustrate our approach using examples from dependency treebanks for Dutch. 1
Querying Linguistic Trees
"... Abstract. Large databases of linguistic annotations are used for testing linguistic hypotheses and for training language processing models. These linguistic annotations are often syntactic or prosodic in nature, and have a hierarchical structure. Query languages are used to select particular structu ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Large databases of linguistic annotations are used for testing linguistic hypotheses and for training language processing models. These linguistic annotations are often syntactic or prosodic in nature, and have a hierarchical structure. Query languages are used to select particular structures of interest, or to project out large slices of a corpus for external analysis. Existing languages suffer from a variety of problems in the areas of expressiveness, efficiency, and naturalness for linguistic query. We describe the domain of linguistic trees and discuss the expressive requirements for a query language. Then we present a language that can express a wide range of queries over these trees, and show that the language is first-order complete over trees. 1.
Representing and Querying Standoff XML
"... The paper discusses the representation and exploitation of multi-level annotated linguistic data. We first present a standoff XML representation, which distributes information over separate, standoff layers and allows us to represent annotations of various kinds in a uniform, generic way. This for ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The paper discusses the representation and exploitation of multi-level annotated linguistic data. We first present a standoff XML representation, which distributes information over separate, standoff layers and allows us to represent annotations of various kinds in a uniform, generic way. This format serves as our interchange format. We further introduce an XML-inline representation that is designed to provide for a more efficient processing of the data. This format is computed on the basis of the standoff representation and uses fragments to represent overlapping elements. We then compare both representations by testing their performance with regard to a testsuite. Not surprisingly, the inline variant performs much better than the standoff variant, in particular with more complex queries.
Fangorn: A system for querying very large treebanks
"... The efficiency and robustness of statistical parsers has made it possible to create very large treebanks. These serve as the starting point for further work including enrichment, extraction, and curation: semantic annotations are added, syntactic features are mined, erroneous analyses are corrected. ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
The efficiency and robustness of statistical parsers has made it possible to create very large treebanks. These serve as the starting point for further work including enrichment, extraction, and curation: semantic annotations are added, syntactic features are mined, erroneous analyses are corrected. In many such cases manual processing is required, and this must operate efficiently on the largest scale. We report on an efficient web-based system for querying very large treebanks called Fangorn. It implements an XPath-like query language which is extended with a linguistic operator to capture proximity in the terminal sequence. Query results are displayed using scalable vector graphics and decorated with the original query, making it easy for queries to be modified and resubmitted. Fangorn is built on the Apache Lucene text search engine and is available under the Apache License.
LPath +: A First-Order Complete Language for Linguistic Tree Query
"... Large databases of linguistic annotations are used for testing linguistic hypotheses, and for training language processing models. Linguistic annotations are often syntactic or prosodic and typically have a tree structure. Our goal is to develop a language that can express a wide range of linguistic ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Large databases of linguistic annotations are used for testing linguistic hypotheses, and for training language processing models. Linguistic annotations are often syntactic or prosodic and typically have a tree structure. Our goal is to develop a language that can express a wide range of linguistic tree queries and has an efficient implementation. We argue that by adding some simple closures to the LPath language, we can meet this goal. We call this new addition to the XPath family LPath +. We place LPath and LPath + in the hierarchy of XPath languages and conclude that LPath + is first-order complete over trees. This means LPath + is efficiently implementable in SQL. 1