Results 1 - 10
of
97
Convolution Kernels for Natural Language
- Advances in Neural Information Processing Systems 14
, 2001
"... We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural ..."
Abstract
-
Cited by 204 (7 self)
- Add to MetaCart
We describe the application of kernel methods to Natural Language Processing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional representations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.
New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron
, 2002
"... This paper introduces new learning algorithms for natural language processing based on the perceptron algorithm. We show how the algorithms can be efficiently applied to exponential sized representations of parse trees, such as the "all subtrees" (DOP) representation described by (Bod 98), or a r ..."
Abstract
-
Cited by 164 (5 self)
- Add to MetaCart
This paper introduces new learning algorithms for natural language processing based on the perceptron algorithm. We show how the algorithms can be efficiently applied to exponential sized representations of parse trees, such as the "all subtrees" (DOP) representation described by (Bod 98), or a representation tracking all sub-fragments of a tagged sentence. We give experimental results showing significant improvements on two tasks: parsing Wall Street Journal text, and named-entity extraction from web data.
The Negotiation and Acquisition of Recursive Grammars as a Result of Competition Among Exemplars
- Linguistic Evolution through Language Acquisition: Formal and Computational Models
, 1999
"... this paper is an investigation of how recursive communication systems can come to be. In particular, the investigation explores the possibility that such a system could emerge among the members of a population as the result of a process I characterize as "negotiation," because each individual both c ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
this paper is an investigation of how recursive communication systems can come to be. In particular, the investigation explores the possibility that such a system could emerge among the members of a population as the result of a process I characterize as "negotiation," because each individual both contributes to, and conforms with, the system 1
Parameter Estimation for Statistical Parsing Models: Theory and Practice of Distribution-Free Methods
, 2001
"... A fundamental problem in statistical parsing is the choice of criteria and algorithms used to estimate the parameters in a model. The predominant approach in computational linguistics has been to use a parametric model with some variant of maximum-likelihood estimation. The assumptions under which m ..."
Abstract
-
Cited by 45 (9 self)
- Add to MetaCart
A fundamental problem in statistical parsing is the choice of criteria and algorithms used to estimate the parameters in a model. The predominant approach in computational linguistics has been to use a parametric model with some variant of maximum-likelihood estimation. The assumptions under which maximum-likelihood estimation is justified are arguably quite strong. This paper discusses the statistical theory underlying various parameter-estimation methods, and gives algorithms which depend on alternatives to (smoothed) maximumlikelihood estimation. We first give an overview of results from statistical learning theory. We then show how important concepts from the classification literature -- specifically, generalization results based on margins on training data -- can be derived for parsing models. Finally, we describe parameter estimation algorithms which are motivated by these generalization bounds.
Parsing with the Shortest Derivation
- Proceedings COLING-2000
, 2000
"... tens @ scs.lecd s.ac.uk Common wisdom has it that tile bias of stochastic grammars in favor of shorter deriwttions of a sentence is hamfful and should be redressed. We show that the common wisdom is wrong for stochastic grammars that use elementary trees instead o1 ' conlext-l'ree rules, such as Sto ..."
Abstract
-
Cited by 34 (12 self)
- Add to MetaCart
tens @ scs.lecd s.ac.uk Common wisdom has it that tile bias of stochastic grammars in favor of shorter deriwttions of a sentence is hamfful and should be redressed. We show that the common wisdom is wrong for stochastic grammars that use elementary trees instead o1 ' conlext-l'ree rules, such as Stochastic Tree-Substitution Grammars used by Data-Oriented Parsing models. For such grammars a non-probabilistic metric based on tile shortest derivation outperforms a probabilistic metric on the ATIS and OVIS corpora, while it obtains competitive results on the Wall Street Journal (WSJ) corpus. This paper also contains the first publislmd experiments with DOP on the WSJ. 1.
The DOP estimation method is biased and inconsistent
- Computational Linguistics
, 1998
"... this paper, an estimator is unbiased iff P E # # (#(X)) = P # # for all # , i.e., its expected parameter estimate specifies the same distribution as the true parameters. Similarly, the loss function is mean squared difference between the "true" and estimated distributions, i.e., if# is the eve ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
this paper, an estimator is unbiased iff P E # # (#(X)) = P # # for all # , i.e., its expected parameter estimate specifies the same distribution as the true parameters. Similarly, the loss function is mean squared difference between the "true" and estimated distributions, i.e., if# is the event space (in DOP1, the space of all phrase-structure trees) then: ### P # #(#)(P # #(#) P #(x) (#))
ABL: Alignment-Based Learning
, 2000
"... This pal)or int;roduces a new type of grammar learning algorit;hm, insl)ircd l)y sl,ring edii, dis- tan(;c (Wagner an(t Fis(:hcr, 1974). The algorithm takes a (:oft)us of fiat senl,en(:cs as intml, and rcLurns a corpus of labelled, 1)ra(:keted senl, en(:es. Th( lnel,hod works on pairs of Lured sellt ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
This pal)or int;roduces a new type of grammar learning algorit;hm, insl)ircd l)y sl,ring edii, dis- tan(;c (Wagner an(t Fis(:hcr, 1974). The algorithm takes a (:oft)us of fiat senl,en(:cs as intml, and rcLurns a corpus of labelled, 1)ra(:keted senl, en(:es. Th( lnel,hod works on pairs of Lured sellt,ellCeS l,ha[, have oBe o1: illore words in (:ommon. When t, wo sentences are (tivi(led int,o t)arLs i;haL m'e Lhc same in 1)ol, h s(mLen(:es and t)arLs that m:e (litlrenL, this interreal,ion is used to find ])m'Ls l, haL are hd;cr(:hmgeablc. These t)arLs m'e tak(m as possible (:onsLii, uenLs same type. Afi,er this aligmnent learning step, the sele(:tion learning s(,c 1) s(l(z(:l,s i,he mosL at)le (:onsl;ihmnl;s fi'om all possible (:onsLiLuent,s. This method was used 1,o booLsLra t) stru(:hrc on the A.TIS (:oftres (Mm'(:us et, al., 1993) and on the OVI'S 1 corpus (Bornmina eL al., 1997). While Lhc results are en(:om'aging (we o})l, aincd up t,o 89.25 % non-crossing l)ra(:ket,s 1)rc(:ision), this paper will 1)oini; ouL some of the shorl,COlnings of our apl)rom:h and will suggest 1)ossible sohd,ions.
Probabilistic Syntax
, 2002
"... istic methods for syntax, just as for a long time McCarthy and Hayes (1969) discouraged exploration of probabilistic methods in Artificial Intelligence. Among his arguments were that: (i) Probabilistic models wrongly mix in world knowledge (New York occurs more in text than Dayton, Ohio, but for no ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
istic methods for syntax, just as for a long time McCarthy and Hayes (1969) discouraged exploration of probabilistic methods in Artificial Intelligence. Among his arguments were that: (i) Probabilistic models wrongly mix in world knowledge (New York occurs more in text than Dayton, Ohio, but for no linguistic reason), (ii) Probabilistic models don't model grammaticality (neither Colorless green ideas sleep furiously nor Furiously sleep ideas green colorless have previously been uttered -- and hence must be estimated to have probability zero, Chomsky wrongly assumes -- but the former is grammatical while the latter is not, and (iii) Use of probabilities does not meet the goal of describing the mind-internal I-language as opposed to the observed-in-the-world E-language. This chapter is not meant to be a detailed critique of Chomsky's arguments -- Abney (1996) provides a survey and a rebuttal, and Pereira (2000) has further useful discussion -- but some of these concerns are still importa
An Efficient Implementation of a New DOP Model
- In EACL
, 2003
"... Two apparently opposing DOP models exist in the literature: one which computes the parse tree involving the most frequent subtrees from a treebank and one which computes the parse tree involving the fewest subtrees from a treebank. This paper proposes an integration of the two models which ou ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Two apparently opposing DOP models exist in the literature: one which computes the parse tree involving the most frequent subtrees from a treebank and one which computes the parse tree involving the fewest subtrees from a treebank. This paper proposes an integration of the two models which outperforms each of them separately. Together with a PCFGreduction of DOP we obtain improved accuracy and efficiency on the Wall Street Journal treebank. Our results show an 11% relative reduction in error rate over previous models, and an average processing time of 3.6 seconds per WSJ sentence.

