Results 11  20
of
313
Online LargeMargin Training of Syntactic and Structural Translation Features
"... Minimumerrorrate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative ..."
Abstract

Cited by 118 (12 self)
 Add to MetaCart
(Show Context)
Minimumerrorrate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrasebased model: first, we simultaneously train a large number of Marton and Resnik’s soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 Bleu on a subset of the NIST 2006 ArabicEnglish evaluation data.
Global inference for sentence compression: An integer linear programming approach
 Journal of Artificial Intelligence Research (JAIR
, 2008
"... Sentence compression holds promise for many applications ranging from summarization to subtitle generation. Our work views sentence compression as an optimization problem and uses integer linear programming (ILP) to infer globally optimal compressions in the presence of linguistically motivated cons ..."
Abstract

Cited by 106 (7 self)
 Add to MetaCart
(Show Context)
Sentence compression holds promise for many applications ranging from summarization to subtitle generation. Our work views sentence compression as an optimization problem and uses integer linear programming (ILP) to infer globally optimal compressions in the presence of linguistically motivated constraints. We show how previous formulations of sentence compression can be recast as ILPs and extend these models with novel global constraints. Experimental results on written and spoken texts demonstrate improvements over stateoftheart models. 1.
Exploiting dictionaries in named entity extraction: Combining semimarkov extraction processes and data integration method
 In Proceedings of the ACM SIGKDD Conference
, 2004
"... We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending stateoftheart NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is d ..."
Abstract

Cited by 93 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of improving named entity recognition (NER) systems by using external dictionaries—more specifically, the problem of extending stateoftheart NER systems by incorporating information about the similarity of extracted entities to entities in an external dictionary. This is difficult because most highperformance named entity recognition systems operate by sequentially classifying words as to whether or not they participate in an entity name; however, the most useful similarity measures score entire candidate names. To correct this mismatch we formalize a semiMarkov extraction process which relaxes the usual Markov assumptions. This process is based on sequentially classifying segments of several adjacent words, rather than single words. In addition to allowing a natural way of coupling NER and highperformance record linkage methods, this formalism also allows the direct use of other useful entitylevel features, and provides a more natural formulation of the NER problem than sequential word classification. Experiments in multiple domains show that the new model can substantially improve extraction performance, relative to previously published methods for using external dictionaries in NER.
Constraint classification: A new approach to multiclass classification and ranking
 In Advances in Neural Information Processing Systems 15
, 2002
"... We introduce constraint classification, a framework capturing many flavors of multiclass classification including multilabel classification and ranking, and present a metaalgorithm for learning in this framework. We provide generalization bounds when using a collection of k linear functions to repr ..."
Abstract

Cited by 89 (6 self)
 Add to MetaCart
(Show Context)
We introduce constraint classification, a framework capturing many flavors of multiclass classification including multilabel classification and ranking, and present a metaalgorithm for learning in this framework. We provide generalization bounds when using a collection of k linear functions to represent each hypothesis. We also present empirical and theoretical evidence that constraint classification is more powerful than existing methods of multiclass classification. 1
Structured Models for FinetoCoarse Sentiment Analysis
 Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics
, 2007
"... In this paper we investigate a structured model for jointly classifying the sentiment of text at varying levels of granularity. Inference in the model is based on standard sequence classification techniques using constrained Viterbi to ensure consistent solutions. The primary advantage of such a mod ..."
Abstract

Cited by 88 (6 self)
 Add to MetaCart
(Show Context)
In this paper we investigate a structured model for jointly classifying the sentiment of text at varying levels of granularity. Inference in the model is based on standard sequence classification techniques using constrained Viterbi to ensure consistent solutions. The primary advantage of such a model is that it allows classification decisions from one level in the text to influence decisions at another. Experiments show that this method can significantly reduce classification error relative to models trained in isolation. 1
Label Ranking by Learning Pairwise Preferences
"... Preference learning is an emerging topic that appears in different guises in the recent literature. This work focuses on a particular learning scenario called label ranking, where the problem is to learn a mapping from instances to rankings over a finite number of labels. Our approach for learning s ..."
Abstract

Cited by 87 (20 self)
 Add to MetaCart
(Show Context)
Preference learning is an emerging topic that appears in different guises in the recent literature. This work focuses on a particular learning scenario called label ranking, where the problem is to learn a mapping from instances to rankings over a finite number of labels. Our approach for learning such a mapping, called ranking by pairwise comparison (RPC), first induces a binary preference relation from suitable training data using a natural extension of pairwise classification. A ranking is then derived from the preference relation thus obtained by means of a ranking procedure, whereby different ranking methods can be used for minimizing different loss functions. In particular, we show that a simple (weighted) voting strategy minimizes risk with respect to the wellknown Spearman rank correlation. We compare RPC to existing label ranking methods, which are based on scoring individual labels instead of comparing pairs of labels. Both empirically and theoretically, it is shown that RPC is superior in terms of computational efficiency, and at least competitive in terms of accuracy.
Firstorder probabilistic models for coreference resolution
 In HLT/NAACL
, 2007
"... Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a firstorder probabilistic model for coreference. We outline a set of approximat ..."
Abstract

Cited by 84 (20 self)
 Add to MetaCart
(Show Context)
Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over sets of noun phrases, resulting in a firstorder probabilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our method to the ACE coreference dataset, achieving a 45 % error reduction over a comparable method that only considers features of pairs of noun phrases. This result demonstrates an example of how a firstorder logic representation can be incorporated into a probabilistic model and scaled efficiently. 1
Tuning as ranking
 In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
, 2011
"... We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of param ..."
Abstract

Cited by 83 (0 self)
 Add to MetaCart
(Show Context)
We offer a simple, effective, and scalable method for statistical machine translation parameter tuning based on the pairwise approach to ranking (Herbrich et al., 1999). Unlike the popular MERT algorithm (Och, 2003), our pairwise ranking optimization (PRO) method is not limited to a handful of parameters and can easily handle systems with thousands of features. Moreover, unlike recent approaches built upon the MIRA algorithm of Crammer and Singer (2003) (Watanabe et al., 2007; Chiang et al., 2008b), PRO is easy to implement. It uses offtheshelf linear binary classifier software and can be built on top of an existing MERT framework in a matter of hours. We establish PRO’s scalability and effectiveness by comparing it to MERT and MIRA and demonstrate parity on both phrasebased and syntaxbased systems in a variety of language pairs, using large scale data scenarios. 1
Bundle Methods for Regularized Risk Minimization
"... A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional ..."
Abstract

Cited by 78 (4 self)
 Add to MetaCart
(Show Context)
A wide variety of machine learning problems can be described as minimizing a regularized risk functional, with different algorithms using different notions of risk and different regularizers. Examples include linear Support Vector Machines (SVMs), Gaussian Processes, Logistic Regression, Conditional Random Fields (CRFs), and Lasso amongst others. This paper describes the theory and implementation of a scalable and modular convex solver which solves all these estimation problems. It can be parallelized on a cluster of workstations, allows for datalocality, and can deal with regularizers such as L1 and L2 penalties. In addition to the unified framework we present tight convergence bounds, which show that our algorithm converges in O(1/ɛ) steps to ɛ precision for general convex problems and in O(log(1/ɛ)) steps for continuously differentiable problems. We demonstrate the performance of our general purpose solver on a variety of publicly available datasets.
Guided Learning for Bidirectional Sequence Classification
, 2007
"... In this paper, we propose guided learning, a new learning framework for bidirectional sequence classification. The tasks of learning the order of inference and training the local classifier are dynamically incorporated into a single Perceptron like learning algorithm. We apply this novel learning al ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we propose guided learning, a new learning framework for bidirectional sequence classification. The tasks of learning the order of inference and training the local classifier are dynamically incorporated into a single Perceptron like learning algorithm. We apply this novel learning algorithm to POS tagging. It obtains an error rate of 2.67 % on the standard PTB test set, which represents 3.3 % relative error reduction over the previous best result on the same data set, while using fewer features. 1