Results 1 - 10
of
11
TnT - A Statistical Part-Of-Speech Tagger
, 2000
"... Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech tagger. Contrary to claims found elsewhere in the literature, we argue that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework. A recent comparison h ..."
Abstract
-
Cited by 540 (5 self)
- Add to MetaCart
Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech tagger. Contrary to claims found elsewhere in the literature, we argue that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework. A recent comparison has even shown that TnT performs significantly better for the tested corpora. We describe the basic model of TnT, the techniques used for smoothing and for handling unknown words. Furthermore, we present evaluations on two corpora.
Modeling Out-Of-Vocabulary Words For Robust Speech Recognition
, 2000
"... This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech reco ..."
Abstract
-
Cited by 86 (6 self)
- Add to MetaCart
This thesis concerns the problem of unknown or out-of-vocabulary (00V) words in continuous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encountering an OOV word, a speech recognizer erroneously substitutes the OOV word with a similarly sounding word from its vocabulary. Furthermore, a recognition error due to an OOV word tends to spread errors into neighboring words; dramatically degrading overall recognition performance.
A Stochastic Topological Parser for German
- In Proceedings of COLING 2002
, 2002
"... of German which is corpus-based and built on a simple model of probabilistic CFG parsing. The topological eld model of German provides a linguistically motivated, at macro structure for complex sentences. Besides the practical aspect of developing a robust and accurate topological parser for hybrid ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
(Show Context)
of German which is corpus-based and built on a simple model of probabilistic CFG parsing. The topological eld model of German provides a linguistically motivated, at macro structure for complex sentences. Besides the practical aspect of developing a robust and accurate topological parser for hybrid shallow and deep NLP, we investigate to what extent topological structures can be handled by context-free probabilistic models. We discuss experiments with systematic variants of a topological treebank grammar, which yield competitive results.
Improving the PoS tagging accuracy of Icelandic text
- In Proceedings of the 17 th Nordic Conference of Computational Linguistics (NODALIDA-2009
, 2009
"... Previous work on part-of-speech (PoS) tagging Icelandic has shown that the morphological complexity of the language poses considerable difficulties for PoS taggers. In this paper, we increase the tagging accuracy of Icelandic text by using two methods. First, we present a new tagger, by integrating ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Previous work on part-of-speech (PoS) tagging Icelandic has shown that the morphological complexity of the language poses considerable difficulties for PoS taggers. In this paper, we increase the tagging accuracy of Icelandic text by using two methods. First, we present a new tagger, by integrating an HMM tagger into a linguistic rule-based tagger. Our tagger obtains state-of-the-art tagging accuracy of 92.31 % using the standard test set derived from the IFD corpus, and 92.51 % using a corrected version of the corpus. Second, we design an external tagset, by removing information from the internal tagset which reflects distinctions that are not morphologically based. Using the external tagset for evaluation, the tagging accuracy further increases to 93.63%. 1
Efficient Stochastic Part-of-Speech Tagging for Hungarian
- IN PROC. OF THE THIRD LREC, PAGES 710–717, LAS PALMAS, ESPANHA
, 2002
"... Many of the methods developed for Western European languages and used widespread to produce annotated language resources cannot readily be applied to Central and Eastern European languages, due to the large number of novel phenomena exhibited in the syntax and morphology of these languages, which th ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Many of the methods developed for Western European languages and used widespread to produce annotated language resources cannot readily be applied to Central and Eastern European languages, due to the large number of novel phenomena exhibited in the syntax and morphology of these languages, which these methods have to handle but have not been designed to cope with. The process of morphological tagging when applied to Hungarian data to produce corpora annotated at least at the morphosyntactic level is most indicative of this problem: several of the algorithms (either rule-based or statistical) that have been used very successfully in other domains cannot readily be applied to a language exhibiting such a varied morphology and huge number of wordforms as Hungarian. The paper will describe a robust tagging scenario for Hungarian using a relatively simple stochastic system augmented with external morphological processing, which can overcome the two most conspcicuous problems: the complexity of morphosyntactic descriptions and most importantly the huge number of possible wordforms.
A simple method for tagset comparison
- Proc. of LREC
, 2008
"... Based on the idea that local contexts predict the same basic category across a language, we develop a simple method for comparing tagsets across corpora. The principle differences between tagsets are evidenced by variation in categories in one corpus in the same contexts where another corpus exhibit ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Based on the idea that local contexts predict the same basic category across a language, we develop a simple method for comparing tagsets across corpora. The principle differences between tagsets are evidenced by variation in categories in one corpus in the same contexts where another corpus exhibits only a single tag. Such mismatches highlight differences in the definitions of tags which are crucial when porting technology from one annotation scheme to another. 1.
Representations for category disambiguation
- In Proceedings of the 22 nd International Conference on Computational Linguistics (COLING-08
, 2008
"... we investigate the information needed to disambiguate a word in a local context, when using corpus categories. Specifically, we increase the recall of an error detection method by abstracting the word to be disambiguated to a representation containing information about some of its inherent propertie ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
(Show Context)
we investigate the information needed to disambiguate a word in a local context, when using corpus categories. Specifically, we increase the recall of an error detection method by abstracting the word to be disambiguated to a representation containing information about some of its inherent properties, namely the set of categories it can potentially have. This work thus provides insights into the relation of corpus categories to categories derived from local contexts. 1
Determining Ambiguity Classes for Part-of-Speech Tagging
- In Proceedings of RANLP-07. Borovets
, 2007
"... We examine how words group together in the lexicon, in terms of ambiguity classes, and use this information in a redefined tagset to improve POS tagging. In light of errors in the training data and a limited amount of annotated data, we investigate ways to define ambiguity classes for words which co ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
We examine how words group together in the lexicon, in terms of ambiguity classes, and use this information in a redefined tagset to improve POS tagging. In light of errors in the training data and a limited amount of annotated data, we investigate ways to define ambiguity classes for words which consider the lexicon as a whole and predict unknown uses of words. Fitting words to typical ambiguity classes is shown to provide more accurate ambiguity classes for words and to significantly improve tagging performance.
Evaluating Distributional Properties of Tagsets
, 2010
"... We investigate which distributional properties should be present in a tagset by examining different mappings of various current part-of-speech tagsets, looking at English, German, and Italian corpora. Given the importance of distributional information, we present a simple model for evaluating how a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
We investigate which distributional properties should be present in a tagset by examining different mappings of various current part-of-speech tagsets, looking at English, German, and Italian corpora. Given the importance of distributional information, we present a simple model for evaluating how a tagset mapping captures distribution, specifically by utilizing a notion of frames to capture the local context. In addition to an accuracy metric capturing the internal quality of a tagset, we introduce a way to evaluate the external quality of tagset mappings so that we can ensure that the mapping retains linguistically important information from the original tagset. Although most of the mappings we evaluate are motivated by linguistic concerns, we also explore an automatic, bottom-up way to define mappings, to illustrate that better distributional mappings are possible. Comparing our initial evaluations to POS tagging results, we find that more distributional tagsets can sometimes result in worse accuracy, underscring the need to carefully define the properties of a tagset.
LIBRARIES Modelling Out-of-Vocabulary Words for Robust Speech Recognition
, 2002
"... This thesis concerns the problem of unknown or out-of-vocabulary (OOV) words in contin-uous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encoun-tering an OOV word, a speech re ..."
Abstract
- Add to MetaCart
(Show Context)
This thesis concerns the problem of unknown or out-of-vocabulary (OOV) words in contin-uous speech recognition. Most of today's state-of-the-art speech recognition systems can recognize only words that belong to some predefined finite word vocabulary. When encoun-tering an OOV word, a speech recognizer erroneously substitutes the OOV word with a similarly sounding word from its vocabulary. Furthermore, a recognition error due to an OOV word tends to spread errors into neighboring words; dramatically degrading overall recognition performance. In this thesis we propose a novel approach for handling OOV words within a single-stage recognition framework. To achieve this goal, an explicit and detailed model of OOV words is constructed and then used to augment the closed-vocabulary search space of a standard speech recognizer. This OOV model achieves open-vocabulary recognition through the use of more flexible subword units that can be concatenated during recognition to form new phone sequences corresponding to potential new words. Examples of such subword units are phones, syllables, or some automatically-learned multi-phone sequences. Subword units