Results 1 - 10
of
23
Building a Large Annotated Corpus of English: The Penn Treebank
- COMPUTATIONAL LINGUISTICS
, 1993
"... There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information abou ..."
Abstract
-
Cited by 1654 (9 self)
- Add to MetaCart
There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora. Such corpora are beginning to serve as important research tools for investigators in natural language processing, speech recognition, and integrated spoken language systems, as well as in theoretical linguistics. Annotated corpora promise to be valuable for enterprises as diverse as the automatic construction of statistical models for the grammar of the written and the colloquial spoken language, the development of explicit formal theories of the differing grammars of writing and speech, the investigation of prosodic phenomena in speech, and the evaluation and comparison of the adequacy of parsing models.
In this paper, we review our experience with constructing one such large annotated corpus--the Penn Treebank, a corpus 1 consisting of over 4.5 million words of American English. During the first three-year phase of the Penn Treebank Project (1989-1992), this corpus has been annotated for part-of-speech (POS) information. In addition, over half of it has been annotated for skeletal syntactic structure. These materials are available to members of the Linguistic Data Consortium; for details, see Section 5.1.
Toward a Connectionist Model of Recursion in Human Linguistic Performance
, 1999
"... Naturally occurring speech contains only a limited amount of complex recursive structure, and this is reflected in the empirically documented difficulties that people experience when processing such structures. We present a connectionist model of human performance in processing recursive language st ..."
Abstract
-
Cited by 90 (7 self)
- Add to MetaCart
Naturally occurring speech contains only a limited amount of complex recursive structure, and this is reflected in the empirically documented difficulties that people experience when processing such structures. We present a connectionist model of human performance in processing recursive language structures. The model is trained on simple artificial languages. We find that the qualitative performance profile of the model matches human behavior, both on the relative difficulty of center-embedded and cross-dependency, and between the processing of these complex recursive structures and right-branching recursive constructions. We analyze how these differences in performance are reflected in the internal representations of the model by performing discriminant analyses on these representation both before and after training. Furthermore, we show how a network trained to process recursive structures can also generate such structures in a probabilistic fashion. This work suggests a novel expla...
Finite-State Approximation Of Phrase Structure Grammars
, 1991
"... Phrase-structure grammars are an effective representation for important syntactic and semantic aspects of natural languages , but are computationally too demanding for use as language models in real-time speech recognition. An algorithm is described that computes finite-state approximations for cont ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
Phrase-structure grammars are an effective representation for important syntactic and semantic aspects of natural languages , but are computationally too demanding for use as language models in real-time speech recognition. An algorithm is described that computes finite-state approximations for context-free grammars and equivalent augmented phrase-structure grammar formalisms. The approximation is exact for certain context-free grammars generating regular languages, including all left-linear and right-linear context-free grammars. The algorithm has been used to construct finite-state language models for limited-domain speech recognition tasks.
Two Principles of Parse Preference
, 1990
"... this paper is first to present a compendium of many of these heuristics and secondly to propose two principles that seem to underlie the .heuristics. The first will be useful to researchers engaged in building grammars of similarly broad coverage. The second is of psychological interest and may be a ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
this paper is first to present a compendium of many of these heuristics and secondly to propose two principles that seem to underlie the .heuristics. The first will be useful to researchers engaged in building grammars of similarly broad coverage. The second is of psychological interest and may be a guide for estimating parse preferences for newly discovered am- biguities for which we lack the expe.rience to decide among on a more empirical basis. The mechanism for implementing parse preference heuristics is quite simple. Terminal nodes of a parse tree acquire a score (usually 0) from the lexical entry for the word sense. When a nonterminal node of a parse tree is constructed, it is given an initial score which is the sum of the scores of its child nodes. Var- ious conditions are checked during the construction of the node and, as a result, a score Of 20, 10, 3, -3, -10, or 20 may be added to the initial score. The score of the parse is the score of its root node. The parses of ambiguous sentences are ranked according to their scores. Although simple, this method has been very successful. In this paper, however, rather than describe the heuristics in terms this detailed, we will describe them in terms of the preferences among the alternate structures that motivated our scoring schemes
The finite connectivity of linguistic structure
- In
, 1994
"... While there is no interesting limitation on the degree of right-embedding in acceptable sentences, center-embedding is quite severely restricted. Similarly, while there is no interesting bound on the number of nouns that can occur in acceptable noun compounds, there is a very low bound on the number ..."
Abstract
-
Cited by 39 (2 self)
- Add to MetaCart
While there is no interesting limitation on the degree of right-embedding in acceptable sentences, center-embedding is quite severely restricted. Similarly, while there is no interesting bound on the number of nouns that can occur in acceptable noun compounds, there is a very low bound on the number of causative morphemes that can occur in the verb compounds of agglutinative languages. Turning to the clause-final verb clusters of West Germanic languages, we find another similar bound. A cluster including verbs from one embedded clause may beacceptable, but clusters formed from the verbs of two or three or even more deeply embedded clauses are much more awkward (regardless of whether the subject-verb dependencies are crossing or nested). And in languages that allow multiple wh-extractions from a single clause, extractions of more than one element with a given case quickly become unacceptable. More careful experimental study of the nature of these limitations is needed, in a range of languages, but here a preliminary attempt is made to subsume them all under a single generalization, a version of the familar idea that the human parsing
Connectionist Syntactic Parsing Using Temporal Variable Binding
- Journal of Psycholinguistic Research
"... Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. The work discussed here investigates syntactic parsing in the ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Recent developments in connectionist architectures for symbolic computation have made it possible to investigate parsing in a connectionist network while still taking advantage of the large body of work on parsing in symbolic frameworks. The work discussed here investigates syntactic parsing in the temporal synchrony variable binding model of symbolic computation in a connectionist network. This computational architecture solves the basic problem with previous connectionist architectures, while keeping their advantages. However, the architecture does have some limitations, which impose constraints on parsing in this architecture. Despite these constraints, the architecture is computationally adequate for syntactic parsing. In addition, the constraints make some significant linguistic predictions. These arguments are made using a specific parsing model. The extensive use of partial descriptions of phrase structure trees is crucial to the ability of this model to recover the syntactic st...
Parsing with soft and hard constraints on dependency length
- In Proceedings of the International Workshop on Parsing Technologies (IWPT
, 2005
"... In lexicalized phrase-structure or dependency parses, a word’s modifiers tend to fall near it in the string. We show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
In lexicalized phrase-structure or dependency parses, a word’s modifiers tend to fall near it in the string. We show that a crude way to use dependency length as a parsing feature can substantially improve parsing speed and accuracy in English and Chinese, with more mixed results on German. We then show similar improvements by imposing hard bounds on dependency length and (additionally) modeling the resulting sequence of parse fragments. This simple “vine grammar ” formalism has only finite-state power, but a context-free parameterization with some extra parameters for stringing fragments together. We exhibit a linear-time chart parsing algorithm with a low grammar constant. 1
Memory limitations and structural forgetting: the perception of complex ungrammatical sentences as grammatical
- Language and Cognitive Processes
, 1999
"... Results from an English acceptability-rating experiment are presented which demonstrate that people �nd doubly nested relative clause structures just as acceptable when only two verb phrases are included instead of the grammatically required three. Furthermore, the experiment shows that such sentenc ..."
Abstract
-
Cited by 19 (1 self)
- Add to MetaCart
Results from an English acceptability-rating experiment are presented which demonstrate that people �nd doubly nested relative clause structures just as acceptable when only two verb phrases are included instead of the grammatically required three. Furthermore, the experiment shows that such sentences are acceptable only when the intermediate verb phrase is omitted. A number of speci�c accounts of forgetting are considered. Two early proposed theories of this effect, the disappearing syntactic nodes hypothesis (Frazier, 1985) and the least recent nodes hypothesis (Gibson, 1991), are not consistent with the experimental results. The results, together with other acceptability patterns, suggest that the representations that are retained (and subsequently forgotten) in processing sentences consist of the lexical wordstrings processed thus far. Three possible accounts of the results are considered: (1) the high memory cost pruning hypothesis within the framework of Gibson (1998); (2) a recency/primacy account; and (3) a connectionist account (Christiansen & Chater, in press).

