Results 1 - 10
of
16
Contextual dependencies in unsupervised word segmentation
- In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics
, 2006
"... Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies ..."
Abstract
-
Cited by 43 (12 self)
- Add to MetaCart
Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic models), demonstrating the importance of such dependencies for word segmentation. We also show that previous probabilistic models rely crucially on suboptimal search procedures. 1
Mostly-Unsupervised Statistical Segmentation of Japanese: Applications to Kanji
, 2000
"... Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical me ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and grammar or on pre-segmented data. In contrast, we introduce a novel statistical method utilizing unsegmented training data, with performance on kanji sequences comparable to and sometimes surpassing that of morphological analyzers over a variety of error metrics.
Chinese word segmentation and named entity recognition: a pragmatic approach
- Computational Linguistics
, 2005
"... This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragm ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
This paper presents a pragmatic approach to Chinese word segmentation. It differentiates from most of the previous approaches mainly in three respects. First of all, while theoretical linguists have defined Chinese words with various linguistic criteria, Chinese words in this study are defined pragmatically as segmentation units whose definition depends on how they are used and processed in realistic computer applications. Secondly, we propose a pragmatic mathematical framework in which segmenting known words and detecting unknown words of different types (i.e. morphologically derived words, factoids, named entities, and other unlisted words) can be performed simultaneously in a unified way. These tasks are usually conducted separately in other systems. Finally, we do not assume the existence of a universal word segmentation standard which is application independent. Instead, we argue for the necessity of multiple segmentation standards due to the pragmatic fact that different NLP applications might require different granularities of Chinese words. These pragmatic approaches have been implemented in an adaptive Chinese word segmenter, called MSRSeg (access
Do we need chinese word segmentation for statistical machine translation
- In Proceedings of the Third SIGHAN Workshop on Chinese Language Learning
, 2004
"... In Chinese texts, words are not separated by white spaces. This is problematic for many natural language processing tasks. The standard approach is to segment the Chinese character sequence into words. Here, we investigate Chinese word segmentation for statistical machine translation. We pursue two ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
In Chinese texts, words are not separated by white spaces. This is problematic for many natural language processing tasks. The standard approach is to segment the Chinese character sequence into words. Here, we investigate Chinese word segmentation for statistical machine translation. We pursue two goals: the first one is the maximization of the final translation quality; the second is the minimization of the manual effort for building a translation system. The commonly used method for getting the word boundaries is based on a word segmentation tool and a predefined monolingual dictionary. To avoid the dependence of the translation system on an external dictionary, we have developed a system that learns a domainspecific dictionary from the parallel training corpus. This method produces results that are comparable with the predefined dictionary. Further more, our translation system is able to work without word segmentation with only a minor loss in translation quality. 1
A Bottom-up Merging Algorithm for Chinese Unknown Word Extraction
, 2003
"... Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and o ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Statistical methods for extracting Chinese unknown words usually suffer a problem that superfluous character strings with strong statistical associations are extracted as well. To solve this problem, this paper proposes to use a set of general morphological rules to broaden the coverage and on the other hand, the rules are appended with different linguistic and statistical constraints to increase the precision of the representation. To disambiguate rule applications and reduce the complexity of the rule matching, a bottom-up merging algorithm for extraction is proposed, which merges possible morphemes recursively by consulting above the general rules and dynamically decides which rule should be applied first according to the priorities of the rules. Effects of different priority strategies are compared in our experiment, and experimental results show that the performance of proposed method is very promising.
Unsupervised Segmentation of Chinese Corpus Using Accessor Variety (Extended Abstract)
"... Haodi Feng City University of Hong Kong fenghaodi@hotmail.com Kang Chen and Technology TsingHua University, Beijing, PRC Chunyu Kit Department of Chinese, Translation and Linguistics Xiaotie Deng City University of Hong Kong Abstract Chinese texts are di#erent from English texts in that ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
Haodi Feng City University of Hong Kong fenghaodi@hotmail.com Kang Chen and Technology TsingHua University, Beijing, PRC Chunyu Kit Department of Chinese, Translation and Linguistics Xiaotie Deng City University of Hong Kong Abstract Chinese texts are di#erent from English texts in that they have no spaces to mark the boundaries of words. This makes the segmentation a special issue in Chinese texts processing. Since the amount of Chinese texts grows rapidly, especially due to the fast increase of the Internet, the number of Chinese words is also increasing fast. Those segmentation methods that depend on an existing dictionary thus have an obvious defect when they are used to segment texts which may contain words unknown to the dictionary.
Unsupervised statistical segmentation of Japanese Kanji strings
- JOURNAL OF NATURAL LANGUAGE ENGINEERING
, 1999
"... Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character n-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of both standard and novel error metrics.
A method for word segmentation in Vietnamese
- In: Proceedings of the Corpus Linguistics 2003 Conference
, 2003
"... Word segmentation is the very first step in natural language processing for languages such as Vietnamese. Given the fact that un-annotated corpora are the only widely available resources, we propose a method of word segmentation for Vietnamese, which only uses n-gram information. We calculate the pr ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Word segmentation is the very first step in natural language processing for languages such as Vietnamese. Given the fact that un-annotated corpora are the only widely available resources, we propose a method of word segmentation for Vietnamese, which only uses n-gram information. We calculate the probabilities of different combinations of n-grams in a chunk, and choose the one that produces maximum probability. In order to calculate these probabilities, we build a 10M-word corpus of two year newspaper article. The results, while not very impressive, show that the method works. 1.

