Results 1 - 10
of
19
Contrastive estimation: Training log-linear models on unlabeled data
- In Proc. of ACL
, 2005
"... Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are log-linear, allowing the incorporation of arbitrary features into the model. To train on unlabele ..."
Abstract
-
Cited by 160 (16 self)
- Add to MetaCart
Conditional random fields (Lafferty et al., 2001) are quite effective at sequence labeling tasks like shallow parsing (Sha and Pereira, 2003) and namedentity extraction (McCallum and Li, 2003). CRFs are log-linear, allowing the incorporation of arbitrary features into the model. To train on unlabeled data, we require unsupervised estimation methods for log-linear models; few exist. We describe a novel approach, contrastive estimation. We show that the new technique can be intuitively understood as exploiting implicit negative evidence and is computationally efficient. Applied to a sequence labeling problem—POS tagging given a tagging dictionary and unlabeled text—contrastive estimation outperforms EM (with the same feature set), is more robust to degradations of the dictionary, and can largely recover by modeling additional features. 1
Generalized expectation criteria for semi-supervised learning of conditional random fields
- In In Proc. ACL, pages 870 – 878
, 2008
"... This paper presents a semi-supervised training method for linear-chain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distri ..."
Abstract
-
Cited by 108 (11 self)
- Add to MetaCart
This paper presents a semi-supervised training method for linear-chain conditional random fields that makes use of labeled features rather than labeled instances. This is accomplished by using generalized expectation criteria to express a preference for parameter settings in which the model’s distribution on unlabeled data matches a target distribution. We induce target conditional probability distributions of labels given features from both annotated feature occurrences in context and adhoc feature majority label assignment. The use of generalized expectation criteria allows for a dramatic reduction in annotation time by shifting from traditional instance-labeling to feature-labeling, and the methods presented outperform traditional CRF training and other semi-supervised methods when limited human effort is available. 1
Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text
, 2006
"... This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likel ..."
Abstract
-
Cited by 40 (11 self)
- Add to MetaCart
(Show Context)
This thesis is about estimating probabilistic models to uncover useful hidden structure in data; specifically, we address the problem of discovering syntactic structure in natural language text. We present three new parameter estimation techniques that generalize the standard approach, maximum likelihood estimation, in different ways. Contrastive estimation maximizes the conditional probability of the observed data given a “neighborhood” of implicit negative examples. Skewed deterministic annealing locally maximizes likelihood using a cautious parameter search strategy that starts with an easier optimization problem than likelihood, and iteratively moves to harder problems, culminating in likelihood. Structural annealing is similar, but starts with a heavy bias toward simple syntactic structures and gradually relaxes the bias. Our estimation methods do not make use of annotated examples. We consider their performance in both an unsupervised model selection setting, where models trained under different initialization and regularization settings are compared by evaluating the training objective on a small set of unseen, unannotated development data, and supervised model selection, where the most accurate model on the development set (now with annotations)
2004. Annealing techniques for unsupervised statistical language learning
- In Proc. of ACL
, 2004
"... Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the Expectation-Maximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA beg ..."
Abstract
-
Cited by 35 (8 self)
- Add to MetaCart
Exploiting unannotated natural language data is hard largely because unsupervised parameter estimation is hard. We describe deterministic annealing (Rose et al., 1990) as an appealing alternative to the Expectation-Maximization algorithm (Dempster et al., 1977). Seeking to avoid search error, DA begins by globally maximizing an easy concave function and maintains a local maximum as it gradually morphs the function into the desired non-concave likelihood function. Applying DA to parsing and tagging models is shown to be straightforward; significant improvements over EM are shown on a part-of-speech tagging task. We describe a variant, skewed DA, which can incorporate a good initializer when it is available, and show significant improvements over EM on a grammar induction task. 1
Guiding unsupervised grammar induction using contrastive estimation
- In Proc. of IJCAI Workshop on Grammatical Inference Applications
, 2005
"... We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
We describe a novel training criterion for probabilistic grammar induction models, contrastive estimation [Smith and Eisner, 2005], which can be interpreted as exploiting implicit negative evidence and includes a wide class of likelihood-based objective functions. This criterion is a generalization of the function maximized by the Expectation-Maximization algorithm [Dempster et al., 1977]. CE is a natural fit for log-linear models, which can include arbitrary features but for which EM is computationally difficult. We show that, using the same features, log-linear dependency grammar models trained using CE can drastically outperform EMtrained generative models on the task of matching human linguistic annotations (the MATCHLIN-GUIST task). The selection of an implicit negative evidence class—a “neighborhood”—appropriate to a given task has strong implications, but a good neighborhood one can target the objective of grammar induction to a specific application. 1
Leveraging the Margin More Carefully
- In International Conference on Machine Learning, ICML
, 2004
"... Boosting is a popular approach for building accurate classifiers. Despite the initial popular belief, boosting algorithms do exhibit overfitting and are sensitive to label noise. Part of the sensitivity of boosting algorithms to outliers and noise can be attributed to the unboundedness of the ..."
Abstract
-
Cited by 20 (0 self)
- Add to MetaCart
(Show Context)
Boosting is a popular approach for building accurate classifiers. Despite the initial popular belief, boosting algorithms do exhibit overfitting and are sensitive to label noise. Part of the sensitivity of boosting algorithms to outliers and noise can be attributed to the unboundedness of the margin-based loss functions that they employ.
Convex relaxation methods for graphical models: Lagrangian and maximum entropy approaches
, 2008
"... Graphical models provide compact representations of complex probability distributions of many random variables through a collection of potential functions defined on small subsets of these variables. This representation is defined with respect to a graph in which nodes represent random variables and ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
(Show Context)
Graphical models provide compact representations of complex probability distributions of many random variables through a collection of potential functions defined on small subsets of these variables. This representation is defined with respect to a graph in which nodes represent random variables and edges represent the interactions among those random variables. Graphical models provide a powerful and flexible approach to many problems in science and engineering, but also present serious challenges owing to the intractability of optimal inference and estimation over general graphs. In this thesis, we consider convex optimization methods to address two central problems that commonly arise for graphical models. First, we consider the problem of determining the most probable configuration—also known as the maximum a posteriori (MAP) estimate—of all variables in a graphical model, conditioned on (possibly noisy) measurements of some variables. This general problem is intractable, so we consider a Lagrangian relaxation (LR) approach to obtain a tractable dual problem. This involves using the Lagrangian decomposition technique
Mixtures of Conditional Maximum Entropy Models
- In Proc. of ICML-2003
, 2002
"... Driven by successes in several application areas, maximum entropy modeling has recently gained considerable popularity. We generalize the standard maximum entropy formulation of classi cation problems to better handle the case where complex data distributions arise from a mixture of simpler u ..."
Abstract
-
Cited by 14 (8 self)
- Add to MetaCart
(Show Context)
Driven by successes in several application areas, maximum entropy modeling has recently gained considerable popularity. We generalize the standard maximum entropy formulation of classi cation problems to better handle the case where complex data distributions arise from a mixture of simpler underlying (latent) distributions. We develop a theoretical framework for characterizing data as a mixture of maximum entropy models. We formulate a maximum-likelihood interpretation of the mixture model learning, and derive a generalized EM algorithm to solve the corresponding optimization problem. We present empirical results for a number of data sets showing that modeling the data as a mixture of latent maximum entropy models gives signi cant improvement over the standard, single component, maximum entropy approach.
Exploiting Syntactic, Semantic and Lexical Regularities in Language Modeling via Directed Markov Random Fields
- In Proceedings of ICML 2005
, 2005
"... We present a directed Markov random field (MRF) model that combines n-gram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the composite directed MRF model potentially has an exponenti ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We present a directed Markov random field (MRF) model that combines n-gram models, probabilistic context free grammars (PCFGs) and probabilistic latent semantic analysis (PLSA) for the purpose of statistical language modeling. Even though the composite directed MRF model potentially has an exponential number of loops and becomes a context sensitive grammar, we are nevertheless able to estimate its parameters in cubic time using an efficient modified EM method, the generalized inside-outside algorithm, which extends the inside-outside algorithm to incorporate the effects of the n-gram and PLSA language models. We generalize various smoothing techniques to alleviate the sparseness of n-gram counts in cases where there are hidden variables. We also derive an analogous algorithm to calculate the probability of initial subsequence of a sentence, generated by the composite language model. Our experimental results on the Wall Street Journal corpus show that we obtain significant reductions in perplexity compared to the state-of-the-art baseline trigram model with Good-Turing and Kneser-Ney smoothings. 1.