Results 11 - 20
of
529
Aligning Sentences In Bilingual Corpora Using Lexical Information
, 1993
"... In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ig- nore word identities and only consider sentence length (Brown et al., 1991b; Gale and Church, 1991). Our algorithm constructs a simple statisti- cal wor ..."
Abstract
-
Cited by 136 (1 self)
- Add to MetaCart
In this paper, we describe a fast algorithm for aligning sentences with their translations in a bilingual corpus. Existing efficient algorithms ig- nore word identities and only consider sentence length (Brown et al., 1991b; Gale and Church, 1991). Our algorithm constructs a simple statisti- cal word-to-word translation model on the fly during alignment. We find the alignment that maximizes the probability of generating the corpus with this translation model. We have achieved an error rate of approximately 0.4% on Canadian Hansard data, which is a significant improvement over previous results. The algorithm is language indepen- dent.
An Algorithm For Finding Noun Phrase Correspondences In Bilingual Corpora
, 1993
"... The paper describes an algorithm that employs English and French text taggers to associate noun phrases in an aligned bilingual corpus. The taggers provide part-of-speech categories which are used by finite-state recognizers to extract simple noun phrases for both languages. Noun phrases are then ma ..."
Abstract
-
Cited by 132 (0 self)
- Add to MetaCart
The paper describes an algorithm that employs English and French text taggers to associate noun phrases in an aligned bilingual corpus. The taggers provide part-of-speech categories which are used by finite-state recognizers to extract simple noun phrases for both languages. Noun phrases are then mapped to each other using an iterative re-estimation algorithm that bears similarities to the Baum-Welch algorithm which is used for training the taggers. The algorithm provides an alternative to other approaches for finding word correspondences, with the advantage that linguistic structure is incorporated. Improvements to the basic algorithm are described, which enable context to be accounted for when constructing the noun phrase mappings.
TERMIGHT: identifying and translating technical terminology”.
- 4th Conference on Applied Natural Language Processing,
, 1994
"... ..."
Fast and Accurate Sentence Alignment of Bilingual Corpora
- In Stephen D
, 2002
"... Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence ..."
Abstract
-
Cited by 109 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We present a new method for aligning sentences with their translations in a parallel bilingual corpus. Previous approaches have generally been based either on sentence length or word correspondences. Sentence-length-based methods are relatively fast and fairly accurate. Word-correspondence-based methods are generally more accurate but much slower, and usually depend on cognates or a bilingual lexicon. Our method adapts and combines these approaches, achieving high accuracy at a modest computational cost, and requiring no knowledge of the languages or the corpus beyond division into words and sentences. 1
Bitext Maps and Alignment via Pattern Recognition
- Computational Linguistics
, 1999
"... This article advances the state of the art ofbitext mapping by formulating the problem in terms of pattern recognition. From this point of view, the success of a bitext mapping algorithm hinges on how well it performs three tasks: signal generation, noise filtering, and search. The Smooth Injective ..."
Abstract
-
Cited by 105 (0 self)
- Add to MetaCart
(Show Context)
This article advances the state of the art ofbitext mapping by formulating the problem in terms of pattern recognition. From this point of view, the success of a bitext mapping algorithm hinges on how well it performs three tasks: signal generation, noise filtering, and search. The Smooth Injective Map Recognizer (SIMR) algorithm presented here integrates innovative approaches to each of these tasks. Objective evaluation has shown that SIMR's accuracy is consistently high for language pairs as diverse as French/English and Korean/English. If necessary, S IMR's bitext maps can be efficiently converted into segment alignments using the Geometric Segment Alignment (GSA) algorithm, which is also presented here. SIMR has produced bitext maps for over 200 megabytes of French-English bitexts. GSA has converted these maps into alignments. Both the maps and the alignments are available from the Linguistic Data Consortium) 1.
BabelNet: The automatic construction, evaluation and application of a . . .
- ARTIFICIAL INTELLIGENCE
, 2012
"... ..."
The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages
- In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC'2006
, 2006
"... We are presenting a new and unique parallel corpus available in all 2 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is about 10 Million words per language. The UTF-8-encoded corpus has been manually classified according ..."
Abstract
-
Cited by 96 (9 self)
- Add to MetaCart
We are presenting a new and unique parallel corpus available in all 2 official European Union (EU) languages, with additional documents available for some EU candidate countries. The average size is about 10 Million words per language. The UTF-8-encoded corpus has been manually classified according to EUROVOC (1994) subject domains and is available in XML format. Pair-wise paragraph alignment information is available for all 190+ language pair combinations. The corpus is accompanied by a tool to produce a bilingual paragraph-aligned parallel corpus for all possible language pair combinations. Motivation to compile the parallel corpus Parallel corpora are extremely useful to train and evaluate automatic text analysis systems and to generate new linguistic resources such as subject-specific monolingual and multilingual terminology lists, and more. In the full paper, we will elaborate the many uses of parallel corpora, will summarise the specific interest of the Joint Research Centre in this corpus, and will stress the importance of having such a corpus especially for languages for which less linguistic resources are available. What is the JRC Collection of the ‘Acquis Communautaire ’ (short: the JRC-Acquis)
A hierarchical dirichlet language model
- Natural Language Engineering
, 1994
"... We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as 'smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new ..."
Abstract
-
Cited by 94 (3 self)
- Add to MetaCart
We discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as 'smoothing'. A number of interesting differences from smoothing emerge. The insights gained from a probabilistic view of this problem point towards new directions for language modelling. The ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and DNA and protein sequences in molecular biology. The new algorithm is compared with smoothing on a two million word corpus. The methods prove to be about equally accurate, with the hierarchical model using fewer computational resources. 1
A survey of statistical machine translation
, 2007
"... Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular tec ..."
Abstract
-
Cited by 93 (6 self)
- Add to MetaCart
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of state-of-the-art SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions.
A Word-to-Word Model of Translational Equivalence
, 1997
"... Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts f ..."
Abstract
-
Cited by 91 (6 self)
- Add to MetaCart
(Show Context)
Many multilingual NLP applications need to translate words between different languages, but cannot afford the computational expense of inducing or applying a full translation model. For these applications, we have designed a fast algorithm for estimating a partial translation model, which accounts for translational equivalence only at the word level . The model's precision /recall trade-off can be directly controlled via one threshold parameter. This feature makes the model more suitable for applications that are not fully statistical. The model's hidden parameters can be easily conditioned on information extrinsic to the model, providing an easy way to integrate pre-existing knowledge such as part-of-speech, dictionaries, word order, etc.. Our model can link word tokens in parallel texts as well as other translation models in the literature. Unlike other translation models, it can automatically produce dictionarysized translation lexicons, and it can do so with over 99% accuracy.