Results 1 - 10
of
15
A Stochastic Finite-State Word-Segmentation Algorithm For Chinese
- Computational Linguistics
, 1996
"... Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on ..."
Abstract
-
Cited by 99 (9 self)
- Add to MetaCart
Chinese text into dictionary entries and productively derived words, and providing pronunciations for these words; the method incorporates a class-based model in its treatment of personal names. We also evaluate the system's performance, taking into account the fact that people often do not agree on a single seg- mentation.
An Unsupervised Iterative Method for Chinese New Lexicon Extraction
- International Journal of Computational Linguistics & Chinese Language Processing
, 1997
"... An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
An unsupervised iterative approach for extracting a new lexicon (or unknown words) from a Chinese text corpus is proposed in this paper. Instead of using a non-iterative segmentation-mergingfiltering -and-disambiguation approach, the proposed method iteratively integrates the contextual constraints (among word candidates) and a joint character association metric to progressively improve the segmentation results of the input corpus (and thus the new word list.) An augmented dictionary, which includes potential unknown words (in addition to known words), is used to segment the input corpus, unlike traditional approaches which use only known words for segmentation. In the segmentation process, the augmented dictionary is used to impose contextual constraints over known words and potential unknown words within input sentences; an unsupervised Viterbi Training process is then applied to ensure that the selected potential unknown words (and known words) maximize the likelihood of the input ...
Statistical models for word segmentation and unknown resolution
- In Proceedings of ROCLING-92
, 1992
"... In a Chinese sentence, there are no word delimiters, like blanks, between the “words”. Therefore, it is important to identify the word boundaries before processing Chinese text. Traditional approaches tend to use dictionary lookup, morphological rules and heuristics to identify the word boundaries. ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
In a Chinese sentence, there are no word delimiters, like blanks, between the “words”. Therefore, it is important to identify the word boundaries before processing Chinese text. Traditional approaches tend to use dictionary lookup, morphological rules and heuristics to identify the word boundaries. Such approaches may not be applied to a large system due to the complicated linguistic phenomena involved in Chinese morphology and syntax. In this paper, the various available features in a sentence are used to construct a generalized word segmentation model; the various probabilistic models for word segmentation are then derived based on the generalized model. In general, the likelihood measure adopted in a probabilistic model does not provide a scoring mechanism that directly indicates the real ranks of the various candidate segmentation patterns. To enhance the baseline models, a robust adaptive learning algorithm is proposed to adjust the parameters of the baseline models so as to increase the discrimination power and robustness of the models. The simulation shows that cost-effective word segmentation could be achieved under various contexts with the proposed models. It is possible to achieve accuracy in word recognition rate of 99.39 % and sentence recognition rate of 97.65 % in the testing corpus by incorporating word length information to a context-independent word model and applying a robust adaptive learning algorithm in the segmentation process. Since not all lexical items could be found in the system dictionary in real applications, the performance of most word segmentation methods in the literature may degraded significantly when unknown words are encountered. Such an “unknown word problem ” is also examined in this paper. An error recovery mechanism based on the segmentation model is proposed. 1 Preliminary experiments show that the error rates introduced by unknown words could be reduced significantly. 1.
Chinese Word Segmentation as LMR Tagging
- IN PROC. OF SIGHAN WORKSHOP
, 2003
"... In this paper we present Chinese word segmentation algorithms based on the socalled LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite di ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
In this paper we present Chinese word segmentation algorithms based on the socalled LMR tagging. Our LMR taggers are implemented with the Maximum Entropy Markov Model and we then use Transformation-Based Learning to combine the results of the two LMR taggers that scan the input in opposite directions. Our system
Critical Tokenization and its Properties
- Computational Linguistics
, 1997
"... This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
This paper sets out to study critical tokenization, a distinctive type of tokenization following the principle of maximum tokenization. The objective in this paper is to develop its mathematical description and understanding
Recognizing Unregistered Names for Mandarin Word Identification
- Proc. of COLING92
, 1992
"... Word Identification has been an important and active issue in Chinese Natural Language Processing. In this paper, a new mechanism, based on the concept of sublanguage, is proposed for identifying unknown words, especially personal names, iu Chinese newspapers. The proposed mechanism includes title.d ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Word Identification has been an important and active issue in Chinese Natural Language Processing. In this paper, a new mechanism, based on the concept of sublanguage, is proposed for identifying unknown words, especially personal names, iu Chinese newspapers. The proposed mechanism includes title.driven name recognition, adaptive dynamic word for*nation, identification of -character and S-character Chinese names without title. We will show the ezperimenial results for two corpora and compare them with the result * by the NTIIU's statistic-based system, the oaly system that we know has attacked the same problem. The ezperimental results have shown significant improvemenls over the WI syslems without the name identification capability.
Combination and boundary detection approaches on chinese indexing
- Journal of the American Society for Information Science
, 2000
"... Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effe ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Digital libraries store materials in electronic format. Research and development in digital libraries includes content creation, conversion, indexing, organization, and dissemination. The key technological issues are how to search and display desired selections from and across large collections effectively [Schatz & Chen, 1996]. Digital library research projects (DLI-1) sponsored by NSF/ DARPA/NASA have a common theme of bringing search to the net, which is the flagship research effort for the National Information Infrastructure (NII) in the United States. A repository is an indexed collection of objects. Indexing is an important task for searching. The better the indexing, the better the searching result. Developing a universal digital library has been the dream of many researchers, however, there are still many problems to
Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data
- In Proceedings of ACL
, 1998
"... Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the differen ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Chinese word segmentation is the first step in any Chinese NLP system. This paper presents a new algorithm for segmenting Chinese texts without making use of any lexicon and hand-crafted linguistic resource. The statistical data required by the algorithm, that is, mutual information and the difference of t-score between characters, is derived automatically from raw Chinese corpora. The preliminary experiment shows that the segmentation accuracy of our algorithm is acceptable. We hope the gaining of this approach will be beneficial to improving the performance(especially in ability to cope with unknoxw words and ability to adapt to various domains) of the existing segmenters, though the algorithm itself can also be utilized as a stand-alone segmenter in some NLP applications.
Integrating Word Boundary Identification With Sentence Understanding
- Ohio State University
, 1995
"... such as space to indicate word boundaries. Existing Chinese NLP systems therefore employ preprocessors to segment sentences into words. Contrary to the conventional wisdom of separating this issue from the task of sentence understanding, we propose an integrated model that performs word boundary ide ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
such as space to indicate word boundaries. Existing Chinese NLP systems therefore employ preprocessors to segment sentences into words. Contrary to the conventional wisdom of separating this issue from the task of sentence understanding, we propose an integrated model that performs word boundary identification in lockstep with sentence understanding. In this approach, there is no distinction between rules for word boundary identification and rules for sentence understanding. These two functions are combined. Word boundary ambiguities are detected, especially the fallacious ones, when they block the primary task of discovering the inter-relationships among the various constituents of a sentence, which essentially is the essence of the understanding process. In this approach, statistical information is also incorporated, providing the system a quick and fairly reliable starting ground to carry out the primary task of relationship- building.

