• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Unsupervised morphological parsing of Bengali. Language Resources and Evaluation (2006)

by S Dasgupta, V Ng
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 19
Next 10 →

Improving Morphology Induction by Learning Spelling Rules

by Jason Naradowsky
"... Unsupervised learning of morphology is an important task for human learners and in natural language processing systems. Previous systems focus on segmenting words into substrings (taking ⇒ tak.ing), but sometimes a segmentation-only analysis is insufficient (e.g., taking may be more appropriately an ..."
Abstract - Cited by 7 (0 self) - Add to MetaCart
Unsupervised learning of morphology is an important task for human learners and in natural language processing systems. Previous systems focus on segmenting words into substrings (taking ⇒ tak.ing), but sometimes a segmentation-only analysis is insufficient (e.g., taking may be more appropriately analyzed as take+ing, with a spelling rule accounting for the deletion of the stem-final e). In this paper, we develop a Bayesian model for simultaneously inducing both morphology and spelling rules. We show that the addition of spelling rules improves performance over the baseline morphology-only model. 1

Hybrid Stemmer for Gujarati

by Pratikkumar Patel, Kashyap Popat, Pushpak Bhattacharyya - Proceedings of the 1 st Workshop on South and Southeast Asian Natural Languages Processing (WSSANLP), the 23 rd International Conference on Computational Linguistics (COLING), Beijing , 2010
"... In this paper we present a lightweight stemmer for Gujarati using a hybrid approach. Instead of using a completely unsupervised approach, we have harnessed linguistic knowledge in the form of a hand-crafted Gujarati suffix list in order to improve the quality of the stems and suffixes learnt during ..."
Abstract - Cited by 4 (1 self) - Add to MetaCart
In this paper we present a lightweight stemmer for Gujarati using a hybrid approach. Instead of using a completely unsupervised approach, we have harnessed linguistic knowledge in the form of a hand-crafted Gujarati suffix list in order to improve the quality of the stems and suffixes learnt during the training phase. We used the EMILLE corpus for training and evaluating the stemmer’s performance. The use of hand-crafted suffixes boosted the accuracy of our stemmer by about 17 % and helped us achieve an accuracy of 67.86 %. 1

Discovering suffixes: A Case Study for Marathi Language

by Mudassar M. Majgaonker
"... Abstract — Suffix stripping is a pre-processing step required in a number of natural language processing applications. Stemmer is a tool used to perform this step. This paper presents and evaluates a rule-based and an unsupervised Marathi stemmer. The rule-based stemmer uses a set of manually extrac ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
Abstract — Suffix stripping is a pre-processing step required in a number of natural language processing applications. Stemmer is a tool used to perform this step. This paper presents and evaluates a rule-based and an unsupervised Marathi stemmer. The rule-based stemmer uses a set of manually extracted suffix stripping rules whereas the unsupervised approach learns suffixes automatically from a set of words extracted from raw Marathi text. The performance of both the stemmers has been compared on a test dataset consisting of 1500 manually stemmed word. Keywords-component; Marathi morphology, Marathi stemmer, Unsupervised stemmer, Rule-based stemmer, Natural language processing
(Show Context)

Citation Context

...flectional suffixes. In [13] a statistical Hindi stemmer was developed and used for evaluating the performance of the Hindi information retrieval system. Similar work has been done by Dasgupta and Ng =-=[14]-=- for Bengali morphological analyzer. In [5] an unsupervised Hindi stemmer has been discussed. An approach based on “observable paradigms” for Hindi morphological analyzer is proposed in [15]. A rule b...

Punjabi Language Stemmer for nouns and proper names

by Vishal Gupta, Gurpreet Singh Lehal
"... This paper concentrates on Punjabi language noun and proper name stemming. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some invalid w ..."
Abstract - Cited by 3 (3 self) - Add to MetaCart
This paper concentrates on Punjabi language noun and proper name stemming. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some invalid word. In Punjabi language stemming for nouns and proper names, an attempt is made to obtain stem or radix of a Punjabi word and then stem or radix is checked against Punjabi noun and proper name dictionary. An in depth analysis of Punjabi news corpus was made and various possible noun suffixes were identified like ੀ ਆਂ īāṃ, ਿੀਆ ਂ iāṃ, ੀ ਆ ਂ ūāṃ, ੀ ੀ ਂ āṃ, ੀ ਏ īē etc. and the various rules for noun and proper name stemming have been generated. Punjabi language stemmer for nouns and proper names is applied for Punjabi Text Summarization. The efficiency of Punjabi language noun and Proper name stemmer is 87.37%. 1

Morpheme Segmentation for Kannada Standing on the Shoulder of Giants

by Suma Bhat
"... This paper studies the applicability of a set of state-of-the-art unsupervised morphological segmentation algorithms for the problem of morpheme boundary detection in Kannada, a resource-poor language with highly inflectional and agglutinative morphology. The choice of the algorithms for the experim ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
This paper studies the applicability of a set of state-of-the-art unsupervised morphological segmentation algorithms for the problem of morpheme boundary detection in Kannada, a resource-poor language with highly inflectional and agglutinative morphology. The choice of the algorithms for the experiment is based in part on their performance with highly inflected languages such as Finnish and Bengali (complex morphology similar to that of Kannada). When trained on a corpus of about 990K words, the best performing algorithm had an F-measure of 73 % on a test set. The performance was better on a set of inflected nouns than on a set of inflected verbs. Key advantages of the algorithms conducive to efficient morphological analysis of Kannada were identified. An important by-product of this study is an empirical analysis of some aspects of vocabulary growth in Kannada based on the word frequency distribution of the words in the reference corpus.
(Show Context)

Citation Context

...ith’s method of unsupervised learning of morphology (Goldsmith, 2001), 2. Morfessor Categories-MAP (Creutz and Lagus, 2007), and, 3. High-Performance, Language-Independent Morphological Segmentation (=-=Dasgupta and Ng, 2006-=-, 2007). 3.1 Linguistica Goldsmith’s method of unsupervised learning of morphology (popularly known by the name of the tool, Linguistica 1 that implements this technique) is centered around the idea o...

Optimal Stem Identification in Presence of Suffix List

by Vasudevan N, Pushpak Bhattacharyya
"... Abstract Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract Stemming is considered crucial in many NLP and IR applications. In the absence of any linguistic information, stemming is a challenging task. Stemming of words using suffixes of a language as linguistic information is in comparison an easier problem. In this work we considered stemming as a process of obtaining minimum number of lexicon from an unannotated corpus by using a suffix set. We proved that the exact lexicon reduction problem is NP-hard and came up with a polynomial time approximation. One probabilistic model that minimizes the stem distributional entropy is also proposed for stemming. Performances of these models are analyzed using an unannotated corpus and a suffix set of Malayalam, a morphologically rich language of India belonging to the Dravidian family. 1
(Show Context)

Citation Context

...ntification model [9,10] ParaMor system for paradigm learning [11] are also relevant works in the same area. Full morpheme segmentation and automatic induction of orthographic rules by Sajib Dasgupta =-=[12,13]-=- is also a relevant work. We never found any model which use the information from suffix list. To the best of our knowledge, this is the first attempt for stemming in presence of suffix list. We found...

Dcu-lingo24 participation in wmt 2014 hindi-english translation task

by Xiaofeng Wu, Rejwanul Haque, Tsuyoshi Okita, Piyush Arora, Andy Way, Qun Liu - In Proceedings of the Ninth Workshop on Statistical Machine Translation , 2014
"... This paper describes the DCU-Lingo24 submission to WMT 2014 for the Hindi-English translation task. We exploit miscellaneous methods in our system, ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
This paper describes the DCU-Lingo24 submission to WMT 2014 for the Hindi-English translation task. We exploit miscellaneous methods in our system,

A Survey of Common Stemming Techniques and Existing Stemmers for Indian Languages

by Vishal Gupta, Gurpreet Singh Lehal
"... Abstract—Stemming is an operation that relates morphological variants of a word. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some inv ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract—Stemming is an operation that relates morphological variants of a word. The purpose of stemming is to obtain the stem or radix of those words which are not found in dictionary. If stemmed word is present in dictionary, then that is a genuine word, otherwise it may be proper name or some invalid word. Stemming is the process for reducing inflected or sometimes derived words to their stem, base or root form, generally a written word form. The stem need not be identical to the morphological root of the word, it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Stemming is used in Information Retrieval systems to improve performance. The design of stemmers is language specific, and requires some to significant linguistic expertise in the language, as well as the understanding of the needs for a spelling checker for that language. A stemmer’s performance and effectiveness in applications such as spelling checker vary across languages. A typical simple stemmer algorithm involves removing suffixes using a list of frequent suffixes, while a more complex one would use morphological knowledge to derive a stem from the words. In this paper a survey of common stemming techniques and existing stemmers for Indian languages have been presented.

ii Preface

by unknown authors , 2010
"... Proceedings of the 1st Workshop on South and southeast ..."
Abstract - Add to MetaCart
Proceedings of the 1st Workshop on South and southeast

Statistical Stemming for Kannada

by Suma Bhat
"... Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compa ..."
Abstract - Add to MetaCart
Stemming is a process that groups morphologically related words into the same class and is widely used in information retrieval for improving recall rate. Here we study a set of statistical stemmers for Kannada, a resource-poor language with highly inflectional and agglutinative morphology. We compare stemming using simple truncation, clustering and an unsupervised morpheme segmentation algorithm on a sample from a text collection. We observe that a distance measure that rewards longest prefix matches is the best performing clustering-based stemmer. However, using a reasonably performing unsupervised morpheme segmentation seems to outperform the other stemming schemes considered. 1
(Show Context)

Citation Context

...ctiveness. In (Bhat, 2012), a preliminary study of unsupervised algorithms for morpheme segmentation in Kannada is available which observed that a statistical morpheme segmentation algorithm such as (=-=Dasgupta and Ng, 2006-=-) shows a reasonable performance for morpheme segmentation in Kannada. Taking the results of prior studies further, the current study is cast in the knowledge-base gained and compares stemming using a...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University