Results 1  10
of
121
Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability.
 for Computational Linguistics.
, 2011
"... Abstract In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the t ..."
Abstract

Cited by 123 (15 self)
 Add to MetaCart
(Show Context)
Abstract In statistical machine translation, a researcher seeks to determine whether some innovation (e.g., a new feature, model, or inference algorithm) improves translation quality in comparison to a baseline system. To answer this question, he runs an experiment to evaluate the behavior of the two systems on heldout data. In this paper, we consider how to make such experiments more statistically reliable. We provide a systematic analysis of the effects of optimizer instabilityan extraneous variable that is seldom controlled foron experimental outcomes, and make recommendations for reporting results more accurately.
11,001 new features for statistical machine translation
 In North American Chapter of the Association for Computational Linguistics  Human Language Technologies (NAACLHLT
, 2009
"... We use the Margin Infused Relaxed Algorithm of Crammer et al. to add a large number of new features to two machine translation systems: the Hiero hierarchical phrasebased translation system and our syntaxbased translation system. On a largescale ChineseEnglish translation task, we obtain statisti ..."
Abstract

Cited by 117 (2 self)
 Add to MetaCart
(Show Context)
We use the Margin Infused Relaxed Algorithm of Crammer et al. to add a large number of new features to two machine translation systems: the Hiero hierarchical phrasebased translation system and our syntaxbased translation system. On a largescale ChineseEnglish translation task, we obtain statistically significant improvements of +1.5 Bleu and +1.1 Bleu, respectively. We analyze the impact of the new features and the performance of the learning algorithm. 1
Adaptive Regularization of Weight Vectors
 Advances in Neural Information Processing Systems 22
, 2009
"... We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle nonseparable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform ..."
Abstract

Cited by 69 (17 self)
 Add to MetaCart
We present AROW, a new online learning algorithm that combines several useful properties: large margin training, confidence weighting, and the capacity to handle nonseparable data. AROW performs adaptive regularization of the prediction function upon seeing each new instance, allowing it to perform especially well in the presence of label noise. We derive a mistake bound, similar in form to the second order perceptron bound, that does not assume separability. We also relate our algorithm to recent confidenceweighted online learning techniques and show empirically that AROW achieves stateoftheart performance and notable robustness in the case of nonseparable data. 1
Batch tuning strategies for statistical machine translation
 In HLTNAACL
, 2012
"... There has been a proliferation of recent work on SMT tuning algorithms capable of handling larger feature sets than the traditional MERT approach. We analyze a number of these algorithms in terms of their sentencelevel loss functions, which motivates several new approaches, including a Structured SV ..."
Abstract

Cited by 62 (10 self)
 Add to MetaCart
(Show Context)
There has been a proliferation of recent work on SMT tuning algorithms capable of handling larger feature sets than the traditional MERT approach. We analyze a number of these algorithms in terms of their sentencelevel loss functions, which motivates several new approaches, including a Structured SVM. We perform empirical comparisons of eight different tuning strategies, including MERT, in a variety of settings. Among other results, we find that a simple and efficient batch version of MIRA performs at least as well as training online, and consistently outperforms other options. 1
Better word alignments with supervised itg models
 In Proceedings of the Association for Computational Linguistics
, 2009
"... This work investigates supervised word alignment methods that exploit inversion transduction grammar (ITG) constraints. We consider maximum margin and conditional likelihood objectives, including the presentation of a new normal form grammar for canonicalizing derivations. Even for nonITG sen ..."
Abstract

Cited by 50 (7 self)
 Add to MetaCart
This work investigates supervised word alignment methods that exploit inversion transduction grammar (ITG) constraints. We consider maximum margin and conditional likelihood objectives, including the presentation of a new normal form grammar for canonicalizing derivations. Even for nonITG sentence pairs, we show that it is possible learn ITG alignment models by simple relaxations of structured discriminative learning objectives. For efficiency, we describe a set of pruning techniques that together allow us to align sentences two orders of magnitude faster than naive bitext CKY parsing. Finally, we introduce manytoone block alignment features, which significantly improve our ITG models. Altogether, our method results in the best reported AER numbers for ChineseEnglish and a performance improvement of 1.1 BLEU over GIZA++ alignments. 1
Structured Ramp Loss Minimization for Machine Translation
"... This paper seeks to close the gap between training algorithms used in statistical machine translation and machine learning, specifically the framework of empirical risk minimization. We review wellknown algorithms, arguing that they do not optimize the loss functions they are assumed to optimize wh ..."
Abstract

Cited by 37 (4 self)
 Add to MetaCart
(Show Context)
This paper seeks to close the gap between training algorithms used in statistical machine translation and machine learning, specifically the framework of empirical risk minimization. We review wellknown algorithms, arguing that they do not optimize the loss functions they are assumed to optimize when applied to machine translation. Instead, most have implicit connections to particular forms of ramp loss. We propose to minimize ramp loss directly and present a training algorithm that is easy to implement and that performs comparably to others. Most notably, our structured ramp loss minimization algorithm, RAMPION, is less sensitive to initialization and random seeds than standard approaches. 1
Synchronous Tree Adjoining Machine Translation
 In Proceedings of EMNLP
, 2009
"... Tree Adjoining Grammars have wellknown advantages, but are typically considered too difficult for practical systems. We demonstrate that, when done right, adjoining improves translation quality without becoming computationally intractable. Using adjoining to model optionality allows general transla ..."
Abstract

Cited by 27 (2 self)
 Add to MetaCart
Tree Adjoining Grammars have wellknown advantages, but are typically considered too difficult for practical systems. We demonstrate that, when done right, adjoining improves translation quality without becoming computationally intractable. Using adjoining to model optionality allows general translation patterns to be learned without the clutter of endless variations of optional material. The appropriate modifiers can later be spliced in as needed. In this paper, we describe a novel method for learning a type of Synchronous Tree Adjoining Grammar and associated probabilities from aligned tree/string training data. We introduce a method of converting these grammars to a weakly equivalent tree transducer for decoding. Finally, we show that adjoining results in an endtoend improvement of +0.8 BLEU over a baseline statistical syntaxbased MT model on a largescale Arabic/English MT task. 1
Fast consensus decoding over translation forests
 In The Annual Conference of the Association for Computational Linguistics
, 2009
"... The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score. However, MBR targeting BLEU is prohibitively slow to optimize over kbest lists for large k. In this paper, we introduce and analy ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
The minimum Bayes risk (MBR) decoding objective improves BLEU scores for machine translation output relative to the standard Viterbi objective of maximizing model score. However, MBR targeting BLEU is prohibitively slow to optimize over kbest lists for large k. In this paper, we introduce and analyze an alternative to MBR that is equally effective at improving performance, yet is asymptotically faster — running 80 times faster than MBR in experiments with 1000best lists. Furthermore, our fast decoding procedure can select output sentences based on distributions over entire forests of translations, in addition to kbest lists. We evaluate our procedure on translation forests from two largescale, stateoftheart hierarchical machine translation systems. Our forestbased decoding objective consistently outperforms kbest list MBR, giving improvements of up to 1.0 BLEU. 1
Optimizing for sentencelevel BLEU+1 yields short translations
 in Proceedings of the 24th International Conference on Computational Linguistics, ser. COLING ’12
, 2012
"... We study a problem with pairwise ranking optimization (PRO): that it tends to yield too short translations. We find that this is partially due to the inadequate smoothing in PRO’s BLEU+1, which boosts the precision component of BLEU but leaves the brevity penalty unchanged, thus destroying the balan ..."
Abstract

Cited by 19 (9 self)
 Add to MetaCart
(Show Context)
We study a problem with pairwise ranking optimization (PRO): that it tends to yield too short translations. We find that this is partially due to the inadequate smoothing in PRO’s BLEU+1, which boosts the precision component of BLEU but leaves the brevity penalty unchanged, thus destroying the balance between the two, compared to BLEU. It is also partially due to PRO optimizing for a sentencelevel score without a global view on the overall length, which introducing a bias towards short translations; we show that letting PRO optimize a corpuslevel BLEU yields a perfect length. Finally, we find some residual bias due to the interaction of PRO with BLEU+1: such a bias does not exist for a version of MIRA with sentencelevel BLEU+1. We propose several ways to fix the length problem of PRO, including smoothing the brevity penalty, scaling the effective reference length, grounding the precision component, and unclipping the brevity penalty, which yield sizable improvements in test BLEU on two ArabicEnglish datasets: IWSLT (+0.65) and NIST (+0.37).
Jointly Learning to Extract and Compress
"... We learn a joint model of sentence extraction and compression for multidocument summarization. Our model scores candidate summaries according to a combined linear model whose features factor over (1) the ngram types in the summary and (2) the compressions used. We train the model using a marginbas ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
(Show Context)
We learn a joint model of sentence extraction and compression for multidocument summarization. Our model scores candidate summaries according to a combined linear model whose features factor over (1) the ngram types in the summary and (2) the compressions used. We train the model using a marginbased objective whose loss captures end summary quality. Because of the exponentially large set of candidate summaries, we use a cuttingplane algorithm to incrementally detect and add active constraints efficiently. Inference in our model can be cast as an ILP and thereby solved in reasonable time; we also present a fast approximation scheme which achieves similar performance. Our jointly extracted and compressed summaries outperform both unlearned baselines and our learned extractiononly system on both ROUGE and Pyramid, without a drop in judged linguistic quality. We achieve the highest published ROUGE results to date on the TAC 2008 data set. 1