Results 1 - 10
of
12
Probabilistic Part-of-Speech Tagging Using Decision Trees
, 1994
"... In this paper, a new probabilistic tagging method is presented which avoids problems that Markov Model based taggers face, when they have to estimate transition probabilities from sparse data. In this tagging method, transition probabilities are estimated using a decision tree. Based on this method, ..."
Abstract
-
Cited by 413 (4 self)
- Add to MetaCart
In this paper, a new probabilistic tagging method is presented which avoids problems that Markov Model based taggers face, when they have to estimate transition probabilities from sparse data. In this tagging method, transition probabilities are estimated using a decision tree. Based on this method, a part-of-speech tagger (called TreeTagger) has been implemented which achieves 96.36 % accuracy on Penn-Treebank data which is better than that of a trigram tagger (96.06 %) on the same data. Keywords: Corpus-based NLP, Statistical NLP, Part-of-Speech Tagging. 1 Introduction Word forms are often ambiguous in their part-of-speech (POS). The English word form store for example can be either a noun, a finite verb or an infinitive. In an utterance, this ambiguity is normally resolved by the context of a word: e.g. in the sentence "The 1977 PCs could store two pages of data.", store can only be an infinitive. The predictability of the part-of-speech from the context is used by automatic part-...
Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation
, 1998
"... Resnik and Yarowsky (1997) made a set of observations about the state of the art in automatic word sense disambiguation and, motivated by those observations, offered several specific proposals regarding improved evaluation criteria, common training and testing resources, and the definition of sense ..."
Abstract
-
Cited by 88 (8 self)
- Add to MetaCart
Resnik and Yarowsky (1997) made a set of observations about the state of the art in automatic word sense disambiguation and, motivated by those observations, offered several specific proposals regarding improved evaluation criteria, common training and testing resources, and the definition of sense inventories. Subsequent discussion of those proposals resulted in senseval, the first evaluation exercise for word sense disambiguation (Kilgarriff and Palmer forthcoming). This article is a revised and extended version of our 1997 workshop paper, reviewing its observations and proposals and discussing them in light of the senseval exercise. It also includes a new in-depth empirical study of translingually-based sense inventories and distance measures, using statistics collected from native-speaker annotations of 222 polysemous contexts across 12 languages. These data show that monolingual sense distinctions at most levels of granularity can be effectively captured by translations into some ...
Part-of-Speech Tagging and Partial Parsing
- Corpus-Based Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but non-zero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of hand-constructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then hand-edited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Combining Linguistic Knowledge and Statistical Learning in French Part-of-Speech Tagging
- In EACL SIGDAT Workshop
, 1995
"... This paper presents a new part-of-speech tagger that takes into account both linguistic knowledge and statistical learning. ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
This paper presents a new part-of-speech tagger that takes into account both linguistic knowledge and statistical learning.
Comparative State-of-the-Art Survey and Assessment Study of . . .
, 1994
"... Contents 1 Introduction 1 1.1 Rationale : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.2 Method of the survey : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.3 Structure of this document : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2 Token ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Contents 1 Introduction 1 1.1 Rationale : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.2 Method of the survey : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2 1.3 Structure of this document : : : : : : : : : : : : : : : : : : : : : : : : : : : 4 2 Tokenization 5 2.1 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 2.2 Tokenizer Survey : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2.2.1 Unix lex Tokenizer : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6 2.2.1.1 Sample input/output: : : : : : : : : : : : : : : : : : : : : : 7 2.2.2 RXRC Finite-state tokenizer : : : : : : : : : : : : : : : : : : : : : : 7 2.2.2.1 Sample input/output: : : : : : : : : : : : : : : : : : : : : : 8 2.3 Tokeniz
Use of weighted finite state transducers in part of speech tagging
, 1997
"... This paper addresses issues in part of speech disambiguation using finite-state transducers and presents two main contributions to the field. One of them is the use of finite-state machines for part of speech tagging. Linguistic and statistical information is represented in terms of weights on trans ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
This paper addresses issues in part of speech disambiguation using finite-state transducers and presents two main contributions to the field. One of them is the use of finite-state machines for part of speech tagging. Linguistic and statistical information is represented in terms of weights on transitions in weighted finite-state transducers. Another contribution is the successful combination of techniques – linguistic and statistical – for word disambiguation, compounded with the notion of word classes.
Tagging French Without Lexical Probabilities - Combining Linguistic Knowledge And Statistical Learning
"... . This paper explores morpho-syntactic ambiguities for French to develop a strategy for part-of-speech disambiguation that a) reflects the complexity of French as an inflected language, b) optimizes the estimation of probabilities, c) allows the user flexibility in choosing a tagset. The problem in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
. This paper explores morpho-syntactic ambiguities for French to develop a strategy for part-of-speech disambiguation that a) reflects the complexity of French as an inflected language, b) optimizes the estimation of probabilities, c) allows the user flexibility in choosing a tagset. The problem in extracting lexical probabilities from a limited training corpus is that the statistical model may not necessarily represent the use of a particular word in a particular context. In a highly morphologically inflected language, this argument is particularly serious since a word can be tagged with a large number of parts of speech. Due to the lack of sufficient training data, we argue against estimating lexical probabilities to disambiguate parts The work was achieved while the author was at AT&T Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974--0636 2 EVELYNE TZOUKERMANN ET AL. of speech in unrestricted texts. Instead, we use the strength of contextual probabilities along wi...
Decision Trees and NLP: A Case Study in POS Tagging
, 1999
"... This paper presents a machine learning approach to the problems of part-of-speech disambiguation and unknown word guessing, as they appear in Modern Greek. Both problems are cast as classification tasks carried out by decision trees. The data model acquired is capable of capturing the idiosyncrati ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper presents a machine learning approach to the problems of part-of-speech disambiguation and unknown word guessing, as they appear in Modern Greek. Both problems are cast as classification tasks carried out by decision trees. The data model acquired is capable of capturing the idiosyncratic behavior of underlying linguistic phenomena. Decision trees are induced with three algorithms; the first two produce generalized trees, while the third produces binary trees. To meet the requirements of the linguistic datasets, all three algorithms are able to handle set-valued attributes. Evaluation results reveal a subtle differentiation in the performance of the three algorithms, which achieve an accuracy range of 93-95% in POS disambiguation and 82-88% in guessing the POS of unknown words. INTRODUCTION It has recently become apparent that empirical ML can find in NLP an exciting application area. The increasing use of corpus-based learning in place of manual encoding has led to ...
Comparison of Unigram, Bigram, HMM and Brill’s POS Tagging Approaches for some South Asian Languages
"... Part-of-Speech (POS) Tagging is a process that attaches each word in a sentence with a suitable tag from a given set of tags. POS Tagging is important in various areas of Natural Language Processing. Different methods of automating the process have been developed and employed for English and other W ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Part-of-Speech (POS) Tagging is a process that attaches each word in a sentence with a suitable tag from a given set of tags. POS Tagging is important in various areas of Natural Language Processing. Different methods of automating the process have been developed and employed for English and other Western languages. Some similar work, most of which utilize the stochastic approaches for POS Tagging has also been done in the same area for South Asian languages. We experimented with some of the widelyused approaches for POS Tagging on three South Asian languages, Bangla, Hindi and Telegu, using corpora of different sizes. We observed the performance of the approaches and found the Brill’s transformation based tagger’s performance to be superior to the other approaches in all of our experiments, though the use of this approach has been very limited until recently. 1.
Stefan Trausan-Matu, Philippe Dessus (Eds.) Natural Language Processing in Support of Learning: Metrics, Feedback and Connectivity
"... In supporting Lifelong Learning (LLL) on the Social Web (Web2.0), Natural Language Technologies (LT) increasingly play a central role due to the fact that text is the leading medium of communication and collaboration. LT cover now a wide range of topics, including advanced semantic resources and app ..."
Abstract
- Add to MetaCart
In supporting Lifelong Learning (LLL) on the Social Web (Web2.0), Natural Language Technologies (LT) increasingly play a central role due to the fact that text is the leading medium of communication and collaboration. LT cover now a wide range of topics, including advanced semantic resources and applications like ontologies, knowledge extraction, text mining, Natural Language Processing (NLP) and Latent Semantic Analysis (LSA). The peculiarities of Web2.0 impose also the consideration of using LT for social software (social networks analysis) and collaborative interactions on chats and forums. Pragmatics, discourse and conversation analysis are very important analysis domains. For LLL, providing feedback entails measuring differences among learners; between learners and their desired characteristics (e.g., knowledge, competences, motivation, self-regulation processes); or between learners and their looked-for resources (e.g. web-links, articles, courses). Difference measuring often have been performed by computing and analyzing 'distances ' using several techniques like factorial analysis, instance-based learning, clustering, and so on. Corpora on which

