Results 1 - 10
of
36
The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Development of Cross-Linguistically Consistent Broad-Coverage Precision Grammars
- Proceedings of the Workshop on Grammar Engineering and Evaluation at the 19th International Conference on Computational Linguistics
, 2002
"... The grammar matrix is an open-source starter-kit for the development of broadcoverage HPSGs. By using a type hierarchy to represent cross-linguistic generalizations and providing compatibility with other open-source tools for grammar engineering, evaluation, parsing and generation, it facilit ..."
Abstract
-
Cited by 32 (9 self)
- Add to MetaCart
The grammar matrix is an open-source starter-kit for the development of broadcoverage HPSGs. By using a type hierarchy to represent cross-linguistic generalizations and providing compatibility with other open-source tools for grammar engineering, evaluation, parsing and generation, it facilitates not only quick start-up but also rapid growth towards the wide coverage necessary for robust natural language processing and the precision parses and semantic representations necessary for natural language understanding.
Feature Selection for a Rich HPSG Grammar Using Decision Trees
- In Proceedings of the 6th Conference on Natural Language Learning
, 2002
"... This paper examines feature selection for log linear models over rich constraint-based grammar (HPSG) representations by building decision trees over features in corresponding probabilistic context free grammars (PCFGs). We show that single decision trees do not make optimal use of the available inf ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
This paper examines feature selection for log linear models over rich constraint-based grammar (HPSG) representations by building decision trees over features in corresponding probabilistic context free grammars (PCFGs). We show that single decision trees do not make optimal use of the available information; constructed ensembles of decision trees based on different feature subspaces show significant performance gains (14% parse selection error reduction). We compare the performance of the learned PCFG grammars and log linear models over the same features.
Ensemble-based Active Learning for Parse Selection
"... Supervised estimation methods are widely seen as being superior to semi and fully unsupervised methods. However, supervised methods crucially rely upon training sets that need to be manually annotated. This can be very expensive, especially when skilled annotators are required. Active learning (AL) ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Supervised estimation methods are widely seen as being superior to semi and fully unsupervised methods. However, supervised methods crucially rely upon training sets that need to be manually annotated. This can be very expensive, especially when skilled annotators are required. Active learning (AL) promises to help reduce this annotation cost. Within the complex domain of HPSG parse selection, we show that ideas from ensemble learning can help further reduce the cost of annotation. Our main results show that at times, an ensemble model trained with randomly sampled examples can outperform a single model trained using AL. However, converting the single-model AL method into an ensemble-based AL method shows that even this much stronger baseline model can be improved upon. Our best results show a ¢¤£¦ ¥ reduction in annotation cost compared with single-model random sampling.
Maximum Entropy Models for Realization Ranking
- In Proceedings of the 10th Machine Translation Summit (pp. 109
, 2005
"... In this paper we describe and evaluate di#erent statistical models for the task of realization ranking, i.e. the problem of discriminating between competing surface realizations generated for a given input semantics. Three models are trained and tested; an n-gram language model, a discriminative max ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
In this paper we describe and evaluate di#erent statistical models for the task of realization ranking, i.e. the problem of discriminating between competing surface realizations generated for a given input semantics. Three models are trained and tested; an n-gram language model, a discriminative maximum entropy model using structural features, and a combination of these two. Our realization component forms part of a larger, hybrid MT system.
Paraphrasing Treebanks for Stochastic Realization Ranking
- In Proceedings of the 3rd Workshop on Treebanks and Linguistic Theories
, 2004
"... This paper describes a novel approach to the task of realization ranking, i.e. the choice among competing paraphrases for a given input semantics, as produced by a generation system. We also introduce a notion of symmetric treebanks, which we define as the combination of (a) a set of pairings of sur ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
This paper describes a novel approach to the task of realization ranking, i.e. the choice among competing paraphrases for a given input semantics, as produced by a generation system. We also introduce a notion of symmetric treebanks, which we define as the combination of (a) a set of pairings of surface forms and associated semantics plus (b) the sets of alternative analyses for the surface form and sets of alternate realizations of the semantics. For inclusion of alternate analyses and realizations in the symmetric treebank, we propose to make the underlying linguistic theory explicit and operational, viz. in the form of a broad-coverage computational grammar. Extending earlier work on grammar-based treebanks in the Redwoods (Oepen et al. [13]) paradigm, we present a fully automated procedure to produce a symmetric treebank from existing resources. To evaluate the utility of an initial (albeit smallish) such `expanded' treebank, we report on experimental results for training stochastic discriminative models for the realization ranking task. Our work is set...
The leaf projection path view of parse trees: Exploring string kernels for hpsg parse selection
- in Proceedings of EMNLP 2004
, 2004
"... We present a novel representation of parse trees as lists of paths (leaf projection paths) from leaves to the top level of the tree. This representation allows us to achieve significantly higher accuracy in the task of HPSG parse selection than standard models, and makes the application of string ke ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
We present a novel representation of parse trees as lists of paths (leaf projection paths) from leaves to the top level of the tree. This representation allows us to achieve significantly higher accuracy in the task of HPSG parse selection than standard models, and makes the application of string kernels natural. We define tree kernels via string kernels on projection paths and explore their performance in the context of parse disambiguation. We apply SVM ranking models and achieve an exact sentence accuracy of 85.40 % on the Redwoods corpus. 1
Validation and Evaluation of Automatically Acquired Multiword Expressions for Grammar Engineering
"... This paper focuses on the evaluation of methods for the automatic acquisition of Multiword Expressions (MWEs) for robust grammar engineering. First we investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, compa ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This paper focuses on the evaluation of methods for the automatic acquisition of Multiword Expressions (MWEs) for robust grammar engineering. First we investigate the hypothesis that MWEs can be detected by the distinct statistical properties of their component words, regardless of their type, comparing 3 statistical measures: mutual information (MI), χ 2 and permutation entropy (PE). Our overall conclusion is that at least two measures, MI and PE, seem to differentiate MWEs from non-MWEs. We then investigate the influence of the size and quality of different corpora, using the BNC and the Web search engines Google and Yahoo. We conclude that, in terms of language usage, web generated corpora are fairly similar to more carefully built corpora, like the BNC, indicating that the lack of control and balance of these corpora are probably compensated by their size. Finally, we show a qualitative evaluation of the results of automatically adding extracted MWEs to existing linguistic resources. We argue that such a process improves qualitatively, if a more compositional approach to grammar/lexicon automated extension is adopted. 1
Automated deep lexical acquisition for robust open texts processing
, 2006
"... In this paper, we report on methods to detect and repair lexical errors for deep grammars. The lack of coverage has for long been the major problem for deep processing. The existence of various errors in the hand-crafted large grammars prevents their usage in real applications. The manual detection ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
In this paper, we report on methods to detect and repair lexical errors for deep grammars. The lack of coverage has for long been the major problem for deep processing. The existence of various errors in the hand-crafted large grammars prevents their usage in real applications. The manual detection and repair of errors requires a significant amount of human effort. An experiment with the British National Corpus shows about 70 % of the sentences contain unknown word(s) for the English Resource Grammar (ERG; (Copestake and Flickinger, 2000)). With the help of error mining methods, many lexical errors are discovered, which cause a large part of the parsing failures. Moreover, with a lexical type predictor based on a maximum entropy model, new lexical entries are automatically generated. The contribution of various features for the model are evaluated. With the disambiguated full parsing results, the precision of the predictor is enhanced significantly. 1. Background Deep linguistic processing delivers fine-grained syntactic and semantic analyses which are difficult to achieve with shallow methods. The core part of deep processing is a complex rule system, called the deep grammar. Linguistic data is processed by recursively applying the grammar
Formal Investigations of Underspecified Representations
, 2005
"... In this thesis, two requirements on Underspecified Representation Formalisms are investi-gated in detail in the context of underspecification of scope. The requirement on partial disambiguation, stating that partially disambiguated ambiguities need to be represented, does not carry much content unle ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
In this thesis, two requirements on Underspecified Representation Formalisms are investi-gated in detail in the context of underspecification of scope. The requirement on partial disambiguation, stating that partially disambiguated ambiguities need to be represented, does not carry much content unless it has become clear, exactly what those ambiguities are. In line with König and Reyle (1999), I argue that all theoretically possible patterns of ambiguity, i.e. subsets of readings, can occur in natural language and that therefore an underspecified representation formalism can only be regarded as expressively com-plete, if it provides representations for all of these subsets. This discussion is couched in a general formal setting, which facilitates clean definitions and allows for the derivation of formally precise results. With those formal definitions at hand, various underspeci-fied representation formalisms are evaluated. As it turns out, none of the investigated formalisms is expressively complete, which answers a corresponding question raised in (König and Reyle, 1999). These incompleteness results allow for a straightforward com-parison of the discussed approaches with respect to expressive power, which forms the
Automatic Adaptation of Annotation Standards: Chinese Word Segmentation and POS Tagging – A Case Study
"... Manually annotated corpora are valuable but scarce resources, yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Manually annotated corpora are valuable but scarce resources, yet for many annotation tasks such as treebanking and sequence labeling there exist multiple corpora with different and incompatible annotation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another. We present a simple yet effective strategy that transfers knowledge from a differently annotated corpus to the corpus with desired annotation. We test the efficacy of this method in the context of Chinese word segmentation and part-of-speech tagging, where no segmentation and POS tagging standards are widely accepted due to the lack of morphology in Chinese. Experiments show that adaptation from the much larger People’s Daily corpus to the smaller but more popular Penn Chinese Treebank results in significant improvements in both segmentation and tagging accuracies (with error reductions of 30.2 % and 14%, respectively), which in turn helps improve Chinese parsing accuracy. 1

