Results 1 - 10
of
150
The Mathematics of Statistical Machine Translation: Parameter Estimation
- COMPUTATIONAL LINGUISTICS
, 1993
"... ..."
(Show Context)
Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora
, 1997
"... ..."
A Program for Aligning Sentences in Bilingual Corpora
, 1993
"... This paper will describe a method and a program (align) for aligning sentences based on a simple statistical model of character lengths. The program uses the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend ..."
Abstract
-
Cited by 529 (5 self)
- Add to MetaCart
This paper will describe a method and a program (align) for aligning sentences based on a simple statistical model of character lengths. The program uses the fact that longer sentences in one language tend to be translated into longer sentences in the other language, and that shorter sentences tend to be translated into shorter sentences. A probabilistic score is assigned to each proposed correspondence of sentences, based on the scaled difference of lengths of the two sentences (in characters) and the variance of this difference. This probabilistic score is used in a dynamic programming framework to find the maximum likelihood alignment of sentences. It is remarkable that such a simple approach works as well as it does. An evaluation was performed based on a trilingual corpus of economic reports issued by the Union Bank of Switzerland (UBS) in English, French, and German. The method correctly aligned all but 4% of the sentences. Moreover, it is possible to extract a large subcorpus that has a much smaller error rate. By selecting the best-scoring 80% of the alignments, the error rate is reduced from 4% to 0.7%. There were more errors on the English-French subcorpus than on the English-German subcorpus, showing that error rates will depend on the corpus considered; however, both were small enough to hope that the method will be useful for many language pairs. To further research on bilingual corpora, a much larger sample of Canadian Hansards (approximately 90 million words, half in English and and half in French) has been aligned with the align program and will be available through the Data Collection Initiative of the Association for Computational Linguistics (ACL/DCI). In addition, in order to facilitate replication of the align program, an appendix is provided with ...
Measures of Distributional Similarity
- In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics
, 1999
"... We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are three-fold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; a ..."
Abstract
-
Cited by 297 (2 self)
- Add to MetaCart
(Show Context)
We study distributional similarity measures for the purpose of improving probability estimation for unseen cooccurrences. Our contributions are three-fold: an empirical comparison of a broad range of measures; a classification of similarity functions based on the information that they incorporate; and the introduction of a novel function that is superior at evaluating potential proxy distributions.
Automatic Identification of Word Translations from Unrelated English and German Corpora
, 1999
"... Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is ..."
Abstract
-
Cited by 244 (2 self)
- Add to MetaCart
Algorithms for the alignment of words in translated texts are well established. However, only recently new approaches have been proposed to identify word translations from non-parallel or even unrelated texts. This task is
Aligning Sentences in Parallel Corpora
- In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (ACL'91
, 1991
"... In this paper we describe a statistical tech-nique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our da.ta, the only information about the sentences that we use for calculating alignments i the number of tokens that ..."
Abstract
-
Cited by 223 (3 self)
- Add to MetaCart
In this paper we describe a statistical tech-nique for aligning sentences with their translations in two parallel corpora. In addition to certain anchor points that are available in our da.ta, the only information about the sentences that we use for calculating alignments i the number of tokens that they contain. Because we make no use of the lexical details of the sentence, the alignment com-putation is fast and therefore practical for appli-cation to very large collections of text. We have used this technique to align several million sen-tences in the English-French Hans~trd corpora nd have achieved an accuracy in excess of 99 % in a random selected set of 1000 sentence pairs that we checked by hand. We show that even without he benefit of anchor points the correlation between the lengths of aligned sentences i strong enough that we should expect o achieve an accuracy of between 96 % and 97%. Thus, the technique may be applicable to a wider variety of texts than we have yet tried.
Aligning Sentences In Bilingual Corpora Using Lexical Information
, 1993
"... In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ig- nore word identities and only consider sentence length (Brown et al., 1991b; Gale and Church, 1991). Our algorithm constructs a simple statisti- cal wor ..."
Abstract
-
Cited by 136 (1 self)
- Add to MetaCart
In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ig- nore word identities and only consider sentence length (Brown et al., 1991b; Gale and Church, 1991). Our algorithm constructs a simple statisti- cal word-to-word translation model on the fly during alignment. We find the alignment that maximizes the probability of generating the corpus with this translation model. We have achieved an error rate of approximately 0.4% on Canadian Hansard data, which is a significant improvement over previous results. The algorithm is language indepen- dent.
An Algorithm For Finding Noun Phrase Correspondences In Bilingual Corpora
, 1993
"... The paper describes an algorithm that employs English and French text taggers to associate noun phrases in an aligned bilingual corpus. The taggers provide part-of-speech categories which are used by finite-state recognizers to extract simple noun phrases for both languages. Noun phrases are then ma ..."
Abstract
-
Cited by 132 (0 self)
- Add to MetaCart
The paper describes an algorithm that employs English and French text taggers to associate noun phrases in an aligned bilingual corpus. The taggers provide part-of-speech categories which are used by finite-state recognizers to extract simple noun phrases for both languages. Noun phrases are then mapped to each other using an iterative re-estimation algorithm that bears similarities to the Baum-Welch algorithm which is used for training the taggers. The algorithm provides an alternative to other approaches for finding word correspondences, with the advantage that linguistic structure is incorporated. Improvements to the basic algorithm are described, which enable context to be accounted for when constructing the noun phrase mappings.
TERMIGHT: identifying and translating technical terminology”.
- 4th Conference on Applied Natural Language Processing,
, 1994
"... ..."
Fast and Accurate Sentence Alignment of Bilingual Corpora
- In Stephen D
, 2002
"... Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence ..."
Abstract
-
Cited by 109 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence-based methods are generally more accurate but much slower, and usually depend on cognates or a bilingual lexicon. Our method adapts and combines these approaches, achieving high accuracy at a modest computational cost, and requiring no knowledge of the languages or the corpus beyond division into words and sentences. 1