• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Europarl: A Parallel Corpus for Statistical Machine Translation (0)

by P Koehn
Venue:Proceedings of MT Summit X
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 519
Next 10 →

Posterior regularization for structured latent variable models

by Kuzman Ganchev, João Graça, Lf Inesc-id, Jennifer Gillenwater, Ben Taskar - Journal of Machine Learning Research , 2010
"... We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model co ..."
Abstract - Cited by 138 (8 self) - Add to MetaCart
We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction,
(Show Context)

Citation Context

...ansfer a parser from English to Bulgarian, using the OpenSubtitles corpus (Tiedemann, 2007). The Spanish experiments transfer from English to Spanish using the Spanish portion of the Europarl corpus (=-=Koehn, 2005-=-). For both corpora, we performed word alignments with the open source PostCAT (Graça et al., 2009b) toolkit. We used the Tokyo tagger (Tsuruoka and Tsujii, 2005) to POS tag the English tokens, and ge...

Re-evaluating the role of BLEU in machine translation research

by Chris Callison-burch, Miles Osborne - In EACL , 2006
"... We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and give two significant counterexamples to Bleu’s ..."
Abstract - Cited by 122 (3 self) - Add to MetaCart
We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and give two significant counterexamples to Bleu’s correlation with human judgments of quality. This offers new potential for research which was previously deemed unpromising by an inability to improve upon Bleu scores. 1
(Show Context)

Citation Context

...ments of fluency, with R 2 = 0.002 when the outlier entry is included rectly ranked the systems. We used Systran for the rule-based system, and used the French-English portion of the Europarl corpus (=-=Koehn, 2005-=-) to train the SMT systems and to evaluate all three systems. We built the first phrase-based SMT system with the complete set of Europarl data (1415 million words per language), and optimized its fea...

Improved statistical machine translation using paraphrases

by Chris Callison-burch, Miles Osborne - In Proceedings of HLT/NAACL-2006 , 2006
"... Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with th ..."
Abstract - Cited by 117 (3 self) - Add to MetaCart
Parallel corpora are crucial for training SMT systems. However, for many language pairs they are available only in very limited quantities. For these language pairs a huge portion of phrases encountered at run-time will be unknown. We show how techniques from paraphrasing can be used to deal with these otherwise unknown source language phrases. Our results show that augmenting a stateof-the-art SMT system with paraphrases leads to significantly improved coverage and translation quality. For a training corpus with 10,000 sentence pairs we increase the coverage of unique test set unigrams from 48 % to 90%, with more than half of the newly covered items accurately translated, as opposed to none in current approaches. 1
(Show Context)

Citation Context

...gn We examined the application of paraphrases to deal with unknown phrases when translating from Spanish and French into English. We used the publicly available Europarl multilingual parallel corpus (=-=Koehn, 2005-=-) to create six training corpora for the two language pairs, and used the standard Europarl development and test sets. 4.1 Baseline For a baseline system we produced a phrase-based statistical machine...

BabelNet: The automatic construction, evaluation and application of a . . .

by Roberto Navigli, et al. - ARTIFICIAL INTELLIGENCE , 2012
"... ..."
Abstract - Cited by 96 (31 self) - Add to MetaCart
Abstract not found

A survey of statistical machine translation

by Adam Lopez , 2007
"... Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular tec ..."
Abstract - Cited by 93 (6 self) - Add to MetaCart
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of human-produced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of state-of-the-art SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions.

Syntactic Constraints on Paraphrases Extracted from Parallel Corpora

by Chris Callison-burch
"... ccb cs jhu edu We improve the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrases be the same syntactic type. This is achieved by parsing the English side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alo ..."
Abstract - Cited by 65 (10 self) - Add to MetaCart
ccb cs jhu edu We improve the quality of paraphrases extracted from parallel corpora by requiring that phrases and their paraphrases be the same syntactic type. This is achieved by parsing the English side of a parallel corpus and altering the phrase extraction algorithm to extract phrase labels alongside bilingual phrase pairs. In order to retain broad coverage of non-constituent phrases, complex syntactic labels are introduced. A manual evaluation indicates a 19% absolute improvement in paraphrase quality over the baseline method. 1
(Show Context)

Citation Context

...f their original phrases and whether they remained grammatical when they replaced the original phrase in a sentence. 4.1 Training materials Our paraphrase model was trained using the Europarl corpus (=-=Koehn, 2005-=-). We used ten parallel corpora between English and (each of) Danish, Dutch, Finnish, French, German, Greek, Italian, Portuguese, Spanish, and Swedish, with approximately 30 million words per language...

WIT3: Web inventory of transcribed and translated talks.

by Mauro Cettolo , Christian Girardi , Marcello Federico - In Proc. EAMT, , 2012
"... Abstract We describe here a Web inventory named WIT 3 that offers access to a collection of transcribed and translated talks. The core of WIT 3 is the TED Talks corpus, that basically redistributes the original content published by the TED Conference website (http://www.ted.com). Since 2007, the TE ..."
Abstract - Cited by 52 (3 self) - Add to MetaCart
Abstract We describe here a Web inventory named WIT 3 that offers access to a collection of transcribed and translated talks. The core of WIT 3 is the TED Talks corpus, that basically redistributes the original content published by the TED Conference website (http://www.ted.com). Since 2007, the TED Conference, based in California, has been posting all video recordings of its talks together with subtitles in English and their translations in more than 80 languages. Aside from its cultural and social relevance, this content, which is published under the Creative Commons BY-NC-ND license, also represents a precious language resource for the machine translation research community, thanks to its size, variety of topics, and covered languages. This effort repurposes the original content in a way which is more convenient for machine translation researchers.
(Show Context)

Citation Context

...cal machine translation (SMT), learning is performed on parallel texts, i.e. documents, sentences or even fragments of sentences with their translation(s). Large amounts of in-domain parallel data are usually required to properly train translation and reordering models. Unfortunately, parallel data are a scarce resource, which are freely available only for some language pairs and for few, very specific domains. c© 2012 European Association for Machine Translation. For example, MultiUN (Eisele and Chen, 2010) provides large parallel texts (300 million words) but for only 6 languages; Europarl (Koehn, 2005) consists of the translation into most European languages of the proceedings of the European Parliament (at most 50 million words); JRC-Acquis1 comprises the total body of European Union law applicable to the Member States, written in 22 European languages (35 million words); other smaller parallel corpora in specific domains are included in OPUS (Tiedemann, 2009) for various languages. On the other hand, it is unfeasible for research laboratories to cover all possible needs in terms of parallel texts by resorting to professional translators, given their high cost. The data available at the TE...

Multilingual Topic Models for Unaligned Text Jordan Boyd-Graber

by Olden Street, David M. Blei
"... We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual ..."
Abstract - Cited by 44 (4 self) - Add to MetaCart
We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be
(Show Context)

Citation Context

... the corpus. Using parallel corpora also guarantees that similar themes will be discussed, one of our key assumptions. First, we analyzed the German and English proceedings of the European Parliament =-=[15]-=-, where each chapter is considered to be a distinct document. Each document on the English side of the corpus has a direct translation on the German side; we used a sample of 2796 documents. Another c...

Cross-lingual annotation projection for semantic roles

by Sebastian Padó, Mirella Lapata - Journal of Artificial Intelligence Research , 2009
"... This article considers the task of automatically inducing role-semantic annotations in the FrameNet paradigm for new languages. We propose a general framework that is based on annotation projection, phrased as a graph optimization problem. It is relatively inexpensive and has the potential to reduce ..."
Abstract - Cited by 38 (3 self) - Add to MetaCart
This article considers the task of automatically inducing role-semantic annotations in the FrameNet paradigm for new languages. We propose a general framework that is based on annotation projection, phrased as a graph optimization problem. It is relatively inexpensive and has the potential to reduce the human effort involved in creating role-semantic resources. Within this framework, we present projection models that exploit lexical and syntactic information. We provide an experimental evaluation on an English-German parallel corpus which demonstrates the feasibility of inducing high-precision German semantic role annotation both for manually and automatically annotated English data. 1.
(Show Context)

Citation Context

...e of divergence provides a natural upper bound for the accuracy attainable with annotation projection. 2.1 Sample Selection English-German bi-sentences were drawn from the second release of Europarl (=-=Koehn, 2005-=-), a corpus of professionally translated proceedings of the European Parliament. Europarl is aligned at the document and sentence level and is available in 11 languages. The English– German section co...

Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases

by Yuval Marton, Chris Callison-burch, Philip Resnik
"... Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especial ..."
Abstract - Cited by 37 (2 self) - Add to MetaCart
Untranslated words still constitute a major problem for Statistical Machine Translation (SMT), and current SMT systems are limited by the quantity of parallel training texts. Augmenting the training data with paraphrases generated by pivoting through other languages alleviates this problem, especially for the so-called “low density ” languages. But pivoting requires additional parallel texts. We address this problem by deriving paraphrases monolingually, using distributional semantic similarity measures, thus providing access to larger training resources, such as comparable and unrelated monolingual corpora. We present what is to our knowledge the first successful integration of a collocational approach to untranslated words with an end-to-end, state of the art SMT system demonstrating significant translation improvements in a low-resource setting.
(Show Context)

Citation Context

...e also experimented with Spanish to English (S2E) translation, following Callison-Burch et al. (2006). For baseline we used the Spanish and English sides of the Europarl multilingual parallel corpus (=-=Koehn, 2005-=-), with the standard training, development, and test sets. We created training subset models of 10,000, 20,000, and 80,000 aligned sentences, as described in Callison-Burch et al. (2006). For better c...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University