Results 1  10
of
65
cdec: A decoder, alignment, and learning framework for finitestate and contextfree translation models
 In Proceedings of ACL System Demonstrations
, 2010
"... We present cdec, an open source framework for decoding, aligning with, and training a number of statistical machine translation models, including wordbased models, phrasebased models, and models based on synchronous contextfree grammars. Using a single unified internal representation for translat ..."
Abstract

Cited by 134 (53 self)
 Add to MetaCart
(Show Context)
We present cdec, an open source framework for decoding, aligning with, and training a number of statistical machine translation models, including wordbased models, phrasebased models, and models based on synchronous contextfree grammars. Using a single unified internal representation for translation forests, the decoder strictly separates modelspecific translation logic from general rescoring, pruning, and inference algorithms. From this unified representation, the decoder can extract not only the 1 or kbest translations, but also alignments to a reference, or the quantities necessary to drive discriminative training using gradientbased or gradientfree optimization techniques. Its efficient C++ implementation means that memory use and runtime performance are significantly better than comparable decoders. 1
Online LargeMargin Training of Syntactic and Structural Translation Features
"... Minimumerrorrate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative ..."
Abstract

Cited by 124 (12 self)
 Add to MetaCart
(Show Context)
Minimumerrorrate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrasebased model: first, we simultaneously train a large number of Marton and Resnik’s soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 Bleu on a subset of the NIST 2006 ArabicEnglish evaluation data.
11,001 new features for statistical machine translation
 In North American Chapter of the Association for Computational Linguistics  Human Language Technologies (NAACLHLT
, 2009
"... We use the Margin Infused Relaxed Algorithm of Crammer et al. to add a large number of new features to two machine translation systems: the Hiero hierarchical phrasebased translation system and our syntaxbased translation system. On a largescale ChineseEnglish translation task, we obtain statisti ..."
Abstract

Cited by 117 (2 self)
 Add to MetaCart
(Show Context)
We use the Margin Infused Relaxed Algorithm of Crammer et al. to add a large number of new features to two machine translation systems: the Hiero hierarchical phrasebased translation system and our syntaxbased translation system. On a largescale ChineseEnglish translation task, we obtain statistically significant improvements of +1.5 Bleu and +1.1 Bleu, respectively. We analyze the impact of the new features and the performance of the learning algorithm. 1
Hierarchical phrasebased translation with weighted finite state transducers and . . .
 IN PROCEEDINGS OF HLT/NAACL
, 2010
"... In this article we describe HiFST, a latticebased decoder for hierarchical phrasebased translation and alignment. The decoder is implemented with standard Weighted FiniteState Transducer (WFST) operations as an alternative to the wellknown cube pruning procedure. We find that the use of WFSTs ra ..."
Abstract

Cited by 48 (20 self)
 Add to MetaCart
(Show Context)
In this article we describe HiFST, a latticebased decoder for hierarchical phrasebased translation and alignment. The decoder is implemented with standard Weighted FiniteState Transducer (WFST) operations as an alternative to the wellknown cube pruning procedure. We find that the use of WFSTs rather than kbest lists requires less pruning in translation search, resulting in fewer search errors, better parameter optimization, and improved translation performance. The direct generation of translation lattices in the target language can improve subsequent rescoring procedures, yielding further gains when applying longspan language models and Minimum Bayes Risk decoding. We also provide insights as to how to control the size of the search space defined by hierarchical rules. We show that shallown grammars, lowlevel rule catenation, and other search constraints can help to match the power of the translation system to specific language pairs.
Structured Ramp Loss Minimization for Machine Translation
"... This paper seeks to close the gap between training algorithms used in statistical machine translation and machine learning, specifically the framework of empirical risk minimization. We review wellknown algorithms, arguing that they do not optimize the loss functions they are assumed to optimize wh ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
This paper seeks to close the gap between training algorithms used in statistical machine translation and machine learning, specifically the framework of empirical risk minimization. We review wellknown algorithms, arguing that they do not optimize the loss functions they are assumed to optimize when applied to machine translation. Instead, most have implicit connections to particular forms of ramp loss. We propose to minimize ramp loss directly and present a training algorithm that is easy to implement and that performs comparably to others. Most notably, our structured ramp loss minimization algorithm, RAMPION, is less sensitive to initialization and random seeds than standard approaches. 1
Training Phrase Translation Models with LeavingOneOut
"... Several attempts have been made to learn phrase translation probabilities for phrasebased statistical machine translation that go beyond pure counting of phrases in wordaligned training data. Most approaches report problems with overfitting. We describe a novel leavingoneout approach to prevent ov ..."
Abstract

Cited by 33 (16 self)
 Add to MetaCart
(Show Context)
Several attempts have been made to learn phrase translation probabilities for phrasebased statistical machine translation that go beyond pure counting of phrases in wordaligned training data. Most approaches report problems with overfitting. We describe a novel leavingoneout approach to prevent overfitting that allows us to train phrase models that show improved translation performance on the WMT08 Europarl GermanEnglish task. In contrast to most previous work where phrase models were trained separately from other models used in translation, we include all components such as single word lexica and reordering models in training. Using this consistent training of phrase models we are able to achieve improvements of up to 1.4 points in BLEU. As a side effect, the phrase table size is reduced by more than 80%. 1
Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation
"... We propose a novel probabilistic synchoronous contextfree grammar formalism for statistical machine translation, in which syntactic nonterminal labels are represented as “soft ” preferences rather than as “hard” matching constraints. This formalism allows us to efficiently score unlabeled synchrono ..."
Abstract

Cited by 31 (0 self)
 Add to MetaCart
We propose a novel probabilistic synchoronous contextfree grammar formalism for statistical machine translation, in which syntactic nonterminal labels are represented as “soft ” preferences rather than as “hard” matching constraints. This formalism allows us to efficiently score unlabeled synchronous derivations without forgoing traditional syntactic constraints. Using this score as a feature in a loglinear model, we are able to approximate the selection of the most likely unlabeled derivation. This helps reduce fragmentation of probability across differently labeled derivations of the same translation. It also allows the importance of syntactic preferences to be learned alongside other features (e.g., the language model) and for particular labeling procedures. We show improvements in translation quality on small and medium sized ChinesetoEnglish translation tasks. 1
Semantic role features for machine translation
 In Proceedings of the 23rd International Conference on Computational Linguistics
, 2010
"... We propose semantic role features for a TreetoString transducer to model the reordering/deletion of sourceside semantic roles. These semantic features, as well as the TreetoString templates, are trained based on a conditional loglinear model and are shown to significantly outperform systems t ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
We propose semantic role features for a TreetoString transducer to model the reordering/deletion of sourceside semantic roles. These semantic features, as well as the TreetoString templates, are trained based on a conditional loglinear model and are shown to significantly outperform systems trained based on MaxLikelihood and EM. We also show significant improvement in sentence fluency by using the semantic role features in the loglinear model, based on manual evaluation. 1
Variational Decoding for Statistical Machine Translation
"... Statistical models in machine translation exhibit spurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, fi ..."
Abstract

Cited by 26 (1 self)
 Add to MetaCart
(Show Context)
Statistical models in machine translation exhibit spurious ambiguity. That is, the probability of an output string is split among many distinct derivations (e.g., trees or segmentations). In principle, the goodness of a string is measured by the total probability of its many derivations. However, finding the best string (e.g., during decoding) is then computationally intractable. Therefore, most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation. Instead, we develop a variational approximation, which considers all the derivations but still allows tractable decoding. Our particular variational distributions are parameterized as ngram models. We also analytically show that interpolating these ngram models for different n is similar to minimumrisk decoding for BLEU (Tromble et al., 2008). Experiments show that our approach improves the state of the art. 1
Joint Feature Selection in Distributed Stochastic Learning for LargeScale Discriminative Training in SMT
"... With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of t ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
(Show Context)
With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy local features for SCFGbased SMT that can be read off from rules at runtime, and present a learning algorithm that applies ℓ1/ℓ2 regularization for joint feature selection over distributed stochastic learning processes. We present experiments on learning on 1.5 million training sentences, and show significant improvements over tuning discriminative models on small development sets. 1