Results 1 - 10
of
19
Improving Morphology Induction by Learning Spelling Rules
"... Unsupervised learning of morphology is an important task for human learners and in natural language processing systems. Previous systems focus on segmenting words into substrings (taking ⇒ tak.ing), but sometimes a segmentation-only analysis is insufficient (e.g., taking may be more appropriately an ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Unsupervised learning of morphology is an important task for human learners and in natural language processing systems. Previous systems focus on segmenting words into substrings (taking ⇒ tak.ing), but sometimes a segmentation-only analysis is insufficient (e.g., taking may be more appropriately analyzed as take+ing, with a spelling rule accounting for the deletion of the stem-final e). In this paper, we develop a Bayesian model for simultaneously inducing both morphology and spelling rules. We show that the addition of spelling rules improves performance over the baseline morphology-only model. 1
Hybrid Stemmer for Gujarati
- Proceedings of the 1 st Workshop on South and Southeast Asian Natural Languages Processing (WSSANLP), the 23 rd International Conference on Computational Linguistics (COLING), Beijing
, 2010
"... In this paper we present a lightweight stemmer for Gujarati using a hybrid approach. Instead of using a completely unsupervised approach, we have harnessed linguistic knowledge in the form of a hand-crafted Gujarati suffix list in order to improve the quality of the stems and suffixes learnt during ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper we present a lightweight stemmer for Gujarati using a hybrid approach. Instead of using a completely unsupervised approach, we have harnessed linguistic knowledge in the form of a hand-crafted Gujarati suffix list in order to improve the quality of the stems and suffixes learnt during the training phase. We used the EMILLE corpus for training and evaluating the stemmer’s performance. The use of hand-crafted suffixes boosted the accuracy of our stemmer by about 17 % and helped us achieve an accuracy of 67.86 %. 1
Discovering suffixes: A Case Study for Marathi Language
"... Abstract — Suffix stripping is a pre-processing step required in a number of natural language processing applications. Stemmer is a tool used to perform this step. This paper presents and evaluates a rule-based and an unsupervised Marathi stemmer. The rule-based stemmer uses a set of manually extrac ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract — Suffix stripping is a pre-processing step required in a number of natural language processing applications. Stemmer is a tool used to perform this step. This paper presents and evaluates a rule-based and an unsupervised Marathi stemmer. The rule-based stemmer uses a set of manually extracted suffix stripping rules whereas the unsupervised approach learns suffixes automatically from a set of words extracted from raw Marathi text. The performance of both the stemmers has been compared on a test dataset consisting of 1500 manually stemmed word. Keywords-component; Marathi morphology, Marathi stemmer, Unsupervised stemmer, Rule-based stemmer, Natural language processing
Punjabi Language Stemmer for nouns and proper names
"... This paper concentrates on Punjabi language noun and proper name stemming. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some invalid w ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
This paper concentrates on Punjabi language noun and proper name stemming. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some invalid word. In Punjabi language stemming for nouns and proper names, an attempt is made to obtain stem or radix of a Punjabi word and then stem or radix is checked against Punjabi noun and proper name dictionary. An in depth analysis of Punjabi news corpus was made and various possible noun suffixes were identified like ੀ ਆਂ īāṃ, ਿੀਆ ਂ iāṃ, ੀ ਆ ਂ ūāṃ, ੀ ੀ ਂ āṃ, ੀ ਏ īē etc. and the various rules for noun and proper name stemming have been generated. Punjabi language stemmer for nouns and proper names is applied for Punjabi Text Summarization. The efficiency of Punjabi language noun and Proper name stemmer is 87.37%. 1
Morpheme Segmentation for Kannada Standing on the Shoulder of Giants
"... This paper studies the applicability of a set of state-of-the-art unsupervised morphological segmentation algorithms for the problem of morpheme boundary detection in Kannada, a resource-poor language with highly inflectional and agglutinative morphology. The choice of the algorithms for the experim ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
This paper studies the applicability of a set of state-of-the-art unsupervised morphological segmentation algorithms for the problem of morpheme boundary detection in Kannada, a resource-poor language with highly inflectional and agglutinative morphology. The choice of the algorithms for the experiment is based in part on their performance with highly inflected languages such as Finnish and Bengali (complex morphology similar to that of Kannada). When trained on a corpus of about 990K words, the best performing algorithm had an F-measure of 73 % on a test set. The performance was better on a set of inflected nouns than on a set of inflected verbs. Key advantages of the algorithms conducive to efficient morphological analysis of Kannada were identified. An important by-product of this study is an empirical analysis of some aspects of vocabulary growth in Kannada based on the word frequency distribution of the words in the reference corpus.
Optimal Stem Identification in Presence of Suffix List
"... Abstract Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family. 1
Dcu-lingo24 participation in wmt 2014 hindi-english translation task
- In Proceedings of the Ninth Workshop on Statistical Machine Translation
, 2014
"... This paper describes the DCU-Lingo24 submission to WMT 2014 for the Hindi-English translation task. We exploit miscellaneous methods in our system, ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper describes the DCU-Lingo24 submission to WMT 2014 for the Hindi-English translation task. We exploit miscellaneous methods in our system,
A Survey of Common Stemming Techniques and Existing Stemmers for Indian Languages
"... Abstract—Stemming is an operation that relates morphological variants of a word. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some inv ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—Stemming is an operation that relates morphological variants of a word. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some invalid word. Stemming is the process for reducing inflected or sometimes derived words to their stem, base or root form, generally a written word form. The stem need not be identical to the morphological root of the word, it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Stemming is used in Information Retrieval systems to improve performance. The design of stemmers is language specific, and requires some to significant linguistic expertise in the language, as well as the understanding of the needs for a spelling checker for that language. A stemmer’s performance and effectiveness in applications such as spelling checker vary across languages. A typical simple stemmer algorithm involves removing suffixes using a list of frequent suffixes, while a more complex one would use morphological knowledge to derive a stem from the words. In this paper a survey of common stemming techniques and existing stemmers for Indian languages have been presented.
Statistical Stemming for Kannada
"... Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compa ..."
Abstract
- Add to MetaCart
(Show Context)
Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compare stemming using simple truncation, clustering and an unsupervised morpheme segmentation algorithm on a sample from a text collection. We observe that a distance measure that rewards longest prefix matches is the best performing clustering-based stemmer. However, using a reasonably performing unsupervised morpheme segmentation seems to outperform the other stemming schemes considered. 1