Results 1 - 10
of
35
Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation
, 1998
"... Resnik and Yarowsky (1997) made a set of observations about the state of the art in automatic word sense disambiguation and, motivated by those observations, offered several specific proposals regarding improved evaluation criteria, common training and testing resources, and the definition of sense ..."
Abstract
-
Cited by 125 (8 self)
- Add to MetaCart
(Show Context)
Resnik and Yarowsky (1997) made a set of observations about the state of the art in automatic word sense disambiguation and, motivated by those observations, offered several specific proposals regarding improved evaluation criteria, common training and testing resources, and the definition of sense inventories. Subsequent discussion of those proposals resulted in senseval, the first evaluation exercise for word sense disambiguation (Kilgarriff and Palmer forthcoming). This article is a revised and extended version of our 1997 workshop paper, reviewing its observations and proposals and discussing them in light of the senseval exercise. It also includes a new in-depth empirical study of translingually-based sense inventories and distance measures, using statistics collected from native-speaker annotations of 222 polysemous contexts across 12 languages. These data show that monolingual sense distinctions at most levels of granularity can be effectively captured by translations into some ...
Comparing a Linguistic and a Stochastic Tagger
- Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics
, 1997
"... Concerning different approaches to automatic PoS tagging: EngCG-2, a constraintbased morphological tagger, is compared in a double-blind test with a state-of-the-art statistical tagger on a common disambiguation task using a common tag set. The ex- periments show that for the same amount of remainin ..."
Abstract
-
Cited by 45 (2 self)
- Add to MetaCart
(Show Context)
Concerning different approaches to automatic PoS tagging: EngCG-2, a constraintbased morphological tagger, is compared in a double-blind test with a state-of-the-art statistical tagger on a common disambiguation task using a common tag set. The ex- periments show that for the same amount of remaining ambiguity, the error rate of the statistical tagger is one order of magnitude greater than that of the rule-based one. The two related issues of priming effects compromising the results and disagreement between human annotators are also addressed.
Implementing an Efficient Part-of-Speech Tagger
- Software–Practice and Experience
, 1999
"... An efficient implementation of a part-of-speech tagger for Swedish is described. The stochastic tagger uses a well-established Markov model of the language. The tagger tags 92% of unknown words correctly and up to 97% of all words. Several implementation and optimization considerations are discussed ..."
Abstract
-
Cited by 42 (6 self)
- Add to MetaCart
(Show Context)
An efficient implementation of a part-of-speech tagger for Swedish is described. The stochastic tagger uses a well-established Markov model of the language. The tagger tags 92% of unknown words correctly and up to 97% of all words. Several implementation and optimization considerations are discussed. The main contribution of this paper is the thorough description of the tagging algorithm and the addition of a number of improvements. The paper contains enough detail for the reader to construct a tagger for his own language. Keywords: part-of-speech tagging, word tagging, optimization, hidden Markov models. Introduction In part-of-speech (POS) tagging of a text, each word and punctuation mark in the text is assigned its morphosyntactic tag. Different tagging systems use different sets of tags, but typically a tag describes a word class and some word class specific features, such as number and gender. The number of different tags varies between a dozen and several hundred. Constructing ...
Serial Combination of Rules and Statistics: A Case Study in Czech Tagging
"... A hybrid system is described which combines the strength of manual rulewriting and statistical learning, obtaining results superior to both methods if applied separately. The combination of a rule-based system and a statistical one is not parallel but serial: the rule-based system performing ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
A hybrid system is described which combines the strength of manual rulewriting and statistical learning, obtaining results superior to both methods if applied separately. The combination of a rule-based system and a statistical one is not parallel but serial: the rule-based system performing partial disambiguation with recall close to 100% is applied first, and a trigram HMM tagger runs on its results. An experiment in Czech tagging has been performed with encouraging results.
Developing a hybrid NP parser
- In Proceedings of ANLP-97
, 1997
"... We describe the use of energy function optimization in very shallow syntactic parsing. The approach can use linguistic rules and corpus-based statistics, so the strengths of both linguistic and statistical approaches to NLP can be combined in a single framework. The rules are contextual constraints ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
We describe the use of energy function optimization in very shallow syntactic parsing. The approach can use linguistic rules and corpus-based statistics, so the strengths of both linguistic and statistical approaches to NLP can be combined in a single framework. The rules are contextual constraints for resolving syntactic ambiguities expressed as alternative tags, and the statistical language model consists of corpus-based n-grams of syntactic tags. The success of the hybrid syntactic disambiguator is evaluated against a held-out benchmark corpus. Also the contributions of the linguistic and statistical language models to the hybrid model are estimated. 1
Citation recognition for scientific publications in digital libraries
- in First International Workshop on Document Image Analysis for Libraries (DIAL’04
, 2004
"... In this paper, a method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR.. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori mod ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
In this paper, a method based on part-of-speech tagging (PoS) is used for bibliographic reference structure. This method operates on a roughly structured ASCII file, produced by OCR.. Because of the heterogeneity of the reference structure, the method acts in a bottom-up way, without an a priori model, gathering structural elements from basic tags to sub-fields and fields. Significant tags are first grouped in homogeneous classes according to their categories and then reduced in canonical forms corresponding to record fields: ``authors'', “title”, “conference name”, “date”, etc. Non labeled tokens are integrated in one or another field by either applying PoS correction rules or using a inter- or intra-field model generated from well-detected records. The designed prototype operates with a great satisfaction on different record layouts and character recognition qualities. Without manual intervention, 96.6 % words are correctly attributed, and about 75,9 % references are completely segmented from 2,575 references. 1.
Annotating topological fields and chunks - and revising POS tags at the same time
, 2002
"... Annotating a corpus of German with chunks, topological fields and clause boundaries is both a goal in itself and a step towards further syntactic annotation. Partial annotation can serve as data to test linguistic hypotheses and it can be used as a pre-structuring for further linguistic annotation s ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Annotating a corpus of German with chunks, topological fields and clause boundaries is both a goal in itself and a step towards further syntactic annotation. Partial annotation can serve as data to test linguistic hypotheses and it can be used as a pre-structuring for further linguistic annotation steps. If, however, the underlying part-of-speech (POS) annotation is imperfect, these errors will be passed on to the subsequent levels of annotation and increase annotation errors on those levels. It is especially damaging for subsequent annotation if POS tags are incorrect which provide the framework of the German sentence by demarcating the topological fields and the clause boundaries (e.g. subordinators and verbs). This paper presents a method to automatically annotate a corpus of German with chunks, topological fields and clause boundaries, and improve tagging accuracy at the same time in order to increase the overall annotation accuracy. Tag improvement primarily relies on the linguistic knowledge encoded in the grammar for annotating the topological fields.
Syllable pattern-based unknown morpheme estimation for hybrid part-of-speech tagging of Korean
- Computational Linguistics
, 1999
"... This paper presents a syllable pattern directed generalized unknown morpheme handling method with POSTAG (POStech TAGger), ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
This paper presents a syllable pattern directed generalized unknown morpheme handling method with POSTAG (POStech TAGger),
Towards Learning a Constraint Grammar from Annotated Corpora Using Decision Trees
, 1995
"... Inside the framework of robust parsers for the syntactic analysis of unrestricted text, the aim of this work is the construction of a system capable of automatically learning Constraint Grammar rules from a POS annotated Corpus. The system presented is able by now to acquire constraint rules for ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Inside the framework of robust parsers for the syntactic analysis of unrestricted text, the aim of this work is the construction of a system capable of automatically learning Constraint Grammar rules from a POS annotated Corpus. The system presented is able by now to acquire constraint rules for POS tagging and we plan to extend it to cover syntactic rules. The learning process uses a supervised learning algorithm based on building a discrimination forest, with a decision tree attached to each case of POS ambiguity. The system has been applied to four representative cases of ambiguity performing on a Spanish Corpus. The results obtained in these experiments and some discussion about the appropriateness of the proposed learning technique are presented in this paper. This research has been partially funded by the Spanish Research Department (CICYT) and inscribed as TIC92-0671 1 1 Introduction The task of developing automatic procedures for parsing unrestricted natural langua...