Results 1  10
of
36
A survey of statistical machine translation
, 2007
"... Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of humanproduced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular tec ..."
Abstract

Cited by 93 (6 self)
 Add to MetaCart
Statistical machine translation (SMT) treats the translation of natural language as a machine learning problem. By examining many samples of humanproduced translation, SMT algorithms automatically learn how to translate. SMT has made tremendous strides in less than two decades, and many popular techniques have only emerged within the last few years. This survey presents a tutorial overview of stateoftheart SMT at the beginning of 2007. We begin with the context of the current research, and then move to a formal problem description and an overview of the four main subproblems: translational equivalence modeling, mathematical modeling, parameter estimation, and decoding. Along the way, we present a taxonomy of some different approaches within these areas. We conclude with an overview of evaluation and notes on future directions.
Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction
 In Proceedings of NAACLHLT 2009. Shay
, 2009
"... We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, prov ..."
Abstract

Cited by 65 (11 self)
 Add to MetaCart
(Show Context)
We present a family of priors over probabilistic grammar weights, called the shared logistic normal distribution. This family extends the partitioned logistic normal distribution, enabling factored covariance between the probabilities of different derivation events in the probabilistic grammar, providing a new way to encode prior knowledge about an unknown grammar. We describe a variational EM algorithm for learning a probabilistic grammar based on this family of priors. We then experiment with unsupervised dependency grammar induction and show significant improvements using our model for both monolingual learning and bilingual learning with a nonparallel, multilingual corpus. 1
Improving Unsupervised Dependency Parsing with Richer Contexts and Smoothing
"... Unsupervised grammar induction models tend to employ relatively simple models of syntax when compared to their supervised counterparts. Traditionally, the unsupervised models have been kept simple due to tractability and data sparsity concerns. In this paper, we introduce basic valence frames and le ..."
Abstract

Cited by 46 (1 self)
 Add to MetaCart
(Show Context)
Unsupervised grammar induction models tend to employ relatively simple models of syntax when compared to their supervised counterparts. Traditionally, the unsupervised models have been kept simple due to tractability and data sparsity concerns. In this paper, we introduce basic valence frames and lexical information into an unsupervised dependency grammar inducer and show how this additional information can be leveraged via smoothing. Our model produces stateoftheart results on the task of unsupervised grammar induction, improving over the best previous work by almost 10 percentage points. 1
Why Doesn’t EM Find Good HMM POSTaggers
 In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL). Prague, Czech Republic: Association for Computational Linguistics
, 2007
"... This paper investigates why the HMMs estimated by ExpectationMaximization (EM) produce such poor results as PartofSpeech (POS) taggers. We find that the HMMs estimated by EM generally assign a roughly equal number of word tokens to each hidden state, while the empirical distribution of tokens ..."
Abstract

Cited by 35 (2 self)
 Add to MetaCart
(Show Context)
This paper investigates why the HMMs estimated by ExpectationMaximization (EM) produce such poor results as PartofSpeech (POS) taggers. We find that the HMMs estimated by EM generally assign a roughly equal number of word tokens to each hidden state, while the empirical distribution of tokens to POS tags is highly skewed. This motivates a Bayesian approach using a sparse prior to bias the estimator toward such a skewed distribution. We investigate Gibbs Sampling (GS) and Variational Bayes (VB) estimators and show that VB converges faster than GS for this task and that VB significantly improves 1to1 tagging accuracy over EM.We also show that EM does nearly as well as VB when the number of hidden HMM states is dramatically reduced. We also point out the high variance in all of these estimators, and that they require many more iterations to approach convergence than usually thought. 1
Unsupervised Structure Prediction with NonParallel Multilingual Guidance
"... We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel dat ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
We describe a method for prediction of linguistic structure in a language for which only unlabeled data is available, using annotated data from a set of one or more helper languages. Our approach is based on a model that locally mixes between supervised models from the helper languages. Parallel data is not used, allowing the technique to be applied even in domains where humantranslated texts are unavailable. We obtain stateoftheart performance for two tasks of structure prediction: unsupervised partofspeech tagging and unsupervised dependency parsing. 1
Variational Inference for Adaptor Grammars
"... Adaptor grammars extend probabilistic contextfree grammars to define prior distributions over trees with “rich get richer” dynamics. Inference for adaptor grammars seeks to find parse trees for raw text. This paper describes a variational inference algorithm for adaptor grammars, providing an alter ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
Adaptor grammars extend probabilistic contextfree grammars to define prior distributions over trees with “rich get richer” dynamics. Inference for adaptor grammars seeks to find parse trees for raw text. This paper describes a variational inference algorithm for adaptor grammars, providing an alternative to Markov chain Monte Carlo methods. To derive this method, we develop a stickbreaking representation of adaptor grammars, a representation that enables us to define adaptor grammars with recursion. We report experimental results on a word segmentation task, showing that variational inference performs comparably to MCMC. Further, we show a significant speedup when parallelizing the algorithm. Finally, we report promising results for a new application for adaptor grammars, dependency grammar induction.
Semisupervised learning of dependency parsers using generalized expectation criteria
 IN PROC. ACL
, 2009
"... In this paper, we propose a novel method for semisupervised learning of nonprojective loglinear dependency parsers using directly expressed linguistic prior knowledge (e.g. a noun’s parent is often a verb). Model parameters are estimated using a generalized expectation (GE) objective function that ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we propose a novel method for semisupervised learning of nonprojective loglinear dependency parsers using directly expressed linguistic prior knowledge (e.g. a noun’s parent is often a verb). Model parameters are estimated using a generalized expectation (GE) objective function that penalizes the mismatch between model predictions and linguistic expectation constraints. In a comparison with two prominent “unsupervised” learning methods that require indirect biasing toward the correct syntactic structure, we show that GE can attain better accuracy with as few as 20 intuitive constraints. We also present positive experimental results on longer sentences in multiple languages.
Posterior Sparsity in Unsupervised Dependency Parsing
, 2010
"... A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use partofspeech (POS) tags to group dependencies by parentchild types and investigate ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use partofspeech (POS) tags to group dependencies by parentchild types and investigate sparsityinducing penalties on the posterior distributions of parentchild POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed accuracy over the standard expectation maximization (EM) baseline for 9 of the languages, with an average accuracy improvement of 6%. Further, we show that for 8 out of 12 languages, the new method outperforms models based on standard Bayesian sparsityinducing parameter priors, with an average improvement of 4%. On English text in particular, we show that our approach improves performance over other state of the art techniques.
Covariance in Unsupervised Learning of Probabilistic Grammars
"... Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of wellunderstood, generalpurpose learn ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
Probabilistic grammars offer great flexibility in modeling discrete sequential data like natural language text. Their symbolic component is amenable to inspection by humans, while their probabilistic component helps resolve ambiguity. They also permit the use of wellunderstood, generalpurpose learning algorithms. There has been an increased interest in using probabilistic grammars in the Bayesian setting. To date, most of the literature has focused on using a Dirichlet prior. The Dirichlet prior has several limitations, including that it cannot directly model covariance between the probabilistic grammar’s parameters. Yet, various grammar parameters are expected to be correlated because the elements in language they represent share linguistic properties. In this paper, we suggest an alternative to the Dirichlet prior, a family of logistic normal distributions. We derive an inference algorithm for this family of distributions and experiment with the task of dependency grammar induction, demonstrating performance improvements with our priors on a set of six treebanks in different natural languages. Our covariance framework permits soft parameter tying within grammars and across grammars for text in different languages, and we show empirical gains in a novel learning setting using bilingual, nonparallel data.
Joint Bilingual Sentiment Classification with Unlabeled Parallel Corpora
"... Most previous work on multilingual sentiment analysis has focused on methods to adapt sentiment resources from resourcerich languages to resourcepoor languages. We present a novel approach for joint bilingual sentiment classification at the sentence level that augments available labeled data in ea ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
Most previous work on multilingual sentiment analysis has focused on methods to adapt sentiment resources from resourcerich languages to resourcepoor languages. We present a novel approach for joint bilingual sentiment classification at the sentence level that augments available labeled data in each language with unlabeled parallel data. We rely on the intuition that the sentiment labels for parallel sentences should be similar and present a model that jointly learns improved monolingual sentiment classifiers for each language. Experiments on multiple data sets show that the proposed approach (1) outperforms the monolingual baselines, significantly improving the accuracy for both languages by 3.44%8.12%; (2) outperforms two standard approaches for leveraging unlabeled data; and (3) produces (albeit smaller) performance gains when employing pseudoparallel data from machine translation engines. 1