| Beesley, K. R. Language Identifier: A Computer Program for Automatic Natural-Language Identification of Online Text. In Language at Crossroads: Proceedings of the 19 th Annual Conference of the American Translators Association. |
....in section 5, a summary and future work are presented. 2 Background and Related Work 2.1 The problem of language distinction Many attempts have been made towards building an effective language identification tool. Methods that have been proposed earlier include those using n grams [2] trigrams [3], frequent short words [4] diacritics and special characters [5] and syllable characteristics [6] These methods, in general, achieve a very good result. So much so that the author of [7] has deemed that the problem of written language identification is a solved problem, with the n grams ....
Beesley, K. R. Language Identifier: A Computer Program for Automatic Natural-Language Identification of Online Text. In Language at Crossroads: Proceedings of the 19 th Annual Conference of the American Translators Association.
....modules in detail. Figure 2. An example of the result page 4 Language Identification Automatic language identification has been discussed in the field of document processing. Several statistic models have been tried including using the n gram of characters [6] diacritics, special characters [9], and using the word unigram with heuristics[10] Among these methods, the result of ngram statistics[6] shows the best accuracy level, over 95 These methods, however, are not sufficient for documents on the WWW. First, they presuppose that the input document is correctly decoded, which is ....
Beesley, Kenneth R., "Language Identifier: a Computer Program for Automatic Natural Language Identification of On-line Text", Language at Crossroads: Proceedings of the 29th Annual Conference of the American Translators Association, October 1988, pp 47-54.
....domain where a lot of other coding systems are potential candidates. Automatic language identification has been discussed in the field of document processing. Several statistic models have been tried including using the n gram of characters (Cavnar, 1994) diacritics and special characters (Beesley, 1988), and using the word unigram with heuristics (Henrich, 1989) Among these methods, the result by (Cavnar, 1994) shows the best accuracy over 95 . Giguet, 1995) achieved over 99 accuracy by using a rule based (i.e. non statistic) method. These methods, however, cannot handle EastAsian ....
Beesley, Kenneth R., "Language Identifier: a Computer Program for Automatic Natural Language Identification of On-line Text", Language at Crossroads: Proceedings of the 29th Annual Conference of the American Translators Association, October 1988, pp 47-54.
.... Kulikowski, 1991; Batchelder, 1992; Souter et al. 1994) the presence of particular character n grams (Henrich; 1989; Ziegler, 1991; Souter et al. 1994) and particularly shaped words from images (Nakayama and Spitz, 1993; Sibun and Spitz, 1994) The frequency of character n grams was used by Beesley (1988), Henrich (1989) Cavnar and Trenkle (1994) Dunning (1994) Souter et al. 1994) and Damashek (1995) A number of analytic techniques have been employed, ranging from completely manual, Ingle, 1976; Newman, 1987) to semiautomatic (Kulikowski, 1991) to fully automatic. Batchelder (1992) ....
....to fully automatic. Batchelder (1992) trained a neural network to distinguish languages. Both Henrich (1989) and Ziegler (1991) incorporated a diversity of knowledge into expert systems. Mustonen (1965) Nakayama and Spitz (1993) and Sibun and Spitz (1994) employed forms of discriminant analysis. Beesley (1988) used languagemodeling techniques originally developed for cryptanalysis. Markov models were used by Dunning (1994) One of the methods developed by Souter and his colleagues (1994) tested for the presence of unique character sequences. Henrich (1989) Cavnar and Trenkle (1994) and Souter et ....
[Article contains additional citation context not shown here]
Beesley, Kenneth R. "Language Identifier: A Computer Program for Automatic NaturalLanguage Identification of On-line Text." In Language at Crossroads: Proceedings of the 29th Annual Conference of the American Translators Association, 12-16 Oct 1988, pp. 47-54.
.... words (Kulikowski, 1991; Ingle, 1991) the independent probability of letters and the joint probability of various letter combinations (Rau, 1974 who used English and Spanish text, to devise an identification system for the two languages) n grams of words (Batchelder, 1992) n grams of characters (Beesley, 1988; Cavner Trenkle, 1994) diacritics and special characters (Newman, 1987) syllable characteristics (Mustonen, 1965) morphology and syntax (Ziegler, 1991) More specifically, Heinrich (1989) evaluated two language ID approaches (one using statistics of letter combinations and the other using ....
Beesley, K. R. (1988). Language identifier: A computer program for automatic natural-language identification on on-line text. In Proceedings of the 29th Annual Conference of the American Translators Association, pages 47--54.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC