Results 1 - 10
of
11
A Language-Independent Unsupervised Model for Morphological Segmentation
, 2007
"... Morphological segmentation has been shown to be beneficial to a range of NLP tasks such as machine translation, speech recognition, speech synthesis and information retrieval. Recently, a number of approaches to unsupervised morphological segmentation have been proposed. This paper describes an algo ..."
Abstract
-
Cited by 21 (1 self)
- Add to MetaCart
Morphological segmentation has been shown to be beneficial to a range of NLP tasks such as machine translation, speech recognition, speech synthesis and information retrieval. Recently, a number of approaches to unsupervised morphological segmentation have been proposed. This paper describes an algorithm that draws from previous approaches and combines them into a simple model for morphological segmentation that outperforms other approaches on English and German, and also yields good results on agglutinative languages such as Finnish and Turkish. We also propose a method for detecting variation within stems in an unsupervised fashion. The segmentation quality reached with the new algorithm is good enough to improve grapheme-to-phoneme conversion.
Paramor: From Paradigm Structure to Natural Language Morphology Induction
, 2008
"... Most of the world’s natural languages have complex morphology. But the expense of building morphological analyzers by hand has prevented the development of morphological analysis systems for the large majority of languages. Unsupervised induction techniques, that learn from unannotated text data, ca ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Most of the world’s natural languages have complex morphology. But the expense of building morphological analyzers by hand has prevented the development of morphological analysis systems for the large majority of languages. Unsupervised induction techniques, that learn from unannotated text data, can facilitate the development of computational morphology systems for new languages. Such unsupervised morphological analysis systems have been shown to help natural language processing tasks including speech recognition (Creutz, 2006) and information retrieval (Kurimo and Turunen, 2008). This thesis describes ParaMor, an unsupervised induction algorithm for learning morphological paradigms from large collections of words in any natural language. Paradigms are sets of mutually substitutable morphological operations that organize the inflectional morphology of natural languages. ParaMor focuses on the most common morphological process, suffixation. ParaMor learns paradigms in a three-step algorithm. First, a recall-centric search scours a space of candidate partial paradigms for those which possibly model suffixes of true paradigms. Second, ParaMor merges selected candidates that appear to model portions
STRUCTURES AND DISTRIBUTIONS IN MORPHOLOGY LEARNING
, 2008
"... One of the great challenges in linguistics and cognitive science is to understand the nature of the mental representation of language. The precise mechanisms of the mind are unknown, but can be modeled through observation and experimentation. By viewing the mind as a computational device that receiv ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
One of the great challenges in linguistics and cognitive science is to understand the nature of the mental representation of language. The precise mechanisms of the mind are unknown, but can be modeled through observation and experimentation. By viewing the mind as a computational device that receives input (primary linguistic data) and produces output (the development of grammatical speech) during language acquisition, one can reason about what representations and algorithms must be internal to the learner. In this thesis, I investigate the acquisition of morphology. The principal challenges are how to learn a theory in the presence of sparse data, and in a manner that can provide explanations for the developmental processes in child language acquisition. The main idea underlying this work is that a consideration of the different aspects of language acquisition places strong constraints on cognitively plausible representations and algorithms that are internal to the learner. To develop a model of morphology acquisition, I pursue three lines of work: iv First, I formulate a cognitively-oriented computational framework for studying language acquisition that consists of four components: the linguistic representation, the
Improving Morphology Induction by Learning Spelling Rules
"... Unsupervised learning of morphology is an important task for human learners and in natural language processing systems. Previous systems focus on segmenting words into substrings (taking ⇒ tak.ing), but sometimes a segmentation-only analysis is insufficient (e.g., taking may be more appropriately an ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Unsupervised learning of morphology is an important task for human learners and in natural language processing systems. Previous systems focus on segmenting words into substrings (taking ⇒ tak.ing), but sometimes a segmentation-only analysis is insufficient (e.g., taking may be more appropriately analyzed as take+ing, with a spelling rule accounting for the deletion of the stem-final e). In this paper, we develop a Bayesian model for simultaneously inducing both morphology and spelling rules. We show that the addition of spelling rules improves performance over the baseline morphology-only model. 1
Learning probabilistic paradigms for morphology in a latent class model
- In Proceedings of the Eighth Meeting of the ACL Special Interest Group on Computational Phonology and Morphology at HLT-NAACL 2006
, 2006
"... This paper introduces the probabilistic paradigm, a probabilistic, declarative model of morphological structure. We describe an algorithm that recursively applies Latent Dirichlet Allocation with an orthogonality constraint to discover morphological paradigms as the latent classes within a suffix-st ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
This paper introduces the probabilistic paradigm, a probabilistic, declarative model of morphological structure. We describe an algorithm that recursively applies Latent Dirichlet Allocation with an orthogonality constraint to discover morphological paradigms as the latent classes within a suffix-stem matrix. We apply the algorithm to data preprocessed in several different ways, and show that when suffixes are distinguished for part of speech and allomorphs or gender/conjugational variants are merged, the model is able to correctly learn morphological paradigms
Applying Collocation Segmentation to the ACL Anthology Reference Corpus
"... Collocation is a well-known linguistic phenomenon which has a long history of research and use. In this study I employ collocation segmentation to extract terms from the large and complex ACL Anthology Reference Corpus, and also briefly research and describe the history of the ACL. The results of th ..."
Abstract
- Add to MetaCart
(Show Context)
Collocation is a well-known linguistic phenomenon which has a long history of research and use. In this study I employ collocation segmentation to extract terms from the large and complex ACL Anthology Reference Corpus, and also briefly research and describe the history of the ACL. The results of the study show that until 1986, the most significant terms were related to formal/rule based methods. Starting in 1987, terms related to statistical methods became more important. For instance, language model, similarity measure, text classification. In 1990, the terms Penn Treebank, Mutual Information, statistical parsing, bilingual corpus, and dependency tree became the most important, showing that newly released language resources appeared together with many new research areas in computational linguistics. Although Penn Treebank was a significant term only temporarily in the early nineties, the corpus is still used by researchers today. The most recent significant terms are Bleu score and semantic role labeling. While machine translation as a term is significant throughout the ACL ARC corpus, it is not significant for any particular time period. This shows that some terms can be significant globally while remaining insignificant at a local level. 1
Draft Version Towards Interactive and Automatic Refinement of Translation Rules
, 2004
"... Draft Version Although Machine Translation (MT) has advanced recently for language pairs with large amounts of parallel data, translation quality has not yet reached satisfactory levels, specially not for resource-poor languages with little if any parallel text to train statistical or example-based ..."
Abstract
- Add to MetaCart
(Show Context)
Draft Version Although Machine Translation (MT) has advanced recently for language pairs with large amounts of parallel data, translation quality has not yet reached satisfactory levels, specially not for resource-poor languages with little if any parallel text to train statistical or example-based MT systems. Rule-based transfer MT systems are the only feasible solution for resourcepoor scenarios. However it can prove very costly and time consuming to refine and extend translation rule sets manually by trained computational linguists with knowledge of both languages. If the translation rules are written manually, no matter how many rules there are, coverage and accuracy can always be increased. If they are automatically learned, they might be either too general or too specific. Either way, in the face of unseen examples, the translation rules will need to be refined to account for new data. Thus, the goal of this thesis is to generalize post-edition efforts in an effective way, by identifying and correcting
Using Resource-Rich Languages to Improve Morphological Analysis of Under-Resourced Languages
"... The world-wide proliferation of digital communications has created the need for language and speech processing systems for under-resourced languages. Developing such systems is challenging if only small data sets are available, and the problem is exacerbated for languages with highly productive morp ..."
Abstract
- Add to MetaCart
(Show Context)
The world-wide proliferation of digital communications has created the need for language and speech processing systems for under-resourced languages. Developing such systems is challenging if only small data sets are available, and the problem is exacerbated for languages with highly productive morphology. However, many under-resourced languages are spoken in multi-lingual environments together with at least one resource-rich language and thus have numerous borrowings from resource-rich languages. Based on this insight, we argue that readily available resources from resource-rich languages can be used to bootstrap the morphological analyses of under-resourced languages with complex and productive morphological systems. In a case study of two such languages, Tagalog and Zulu, we show that an easily obtainable English wordlist can be deployed to seed a morphological analysis algorithm from a small training set of conversational transcripts. Our method achieves a precision of 100 % and identifies 28 and 66 of the most productive affixes in Tagalog and Zulu, respectively. Keywords:morphology, language contact, code switching 1.
CENTER FOR THE STUDY OF LANGUAGE AND INFORMATION
"... Font-Llitjos et al., 2005)1 is a machine translation system that automat-ically learns translation rules between two languages. In the Avenue scenario, one of the languages is a resource rich language like English ..."
Abstract
- Add to MetaCart
(Show Context)
Font-Llitjos et al., 2005)1 is a machine translation system that automat-ically learns translation rules between two languages. In the Avenue scenario, one of the languages is a resource rich language like English