12 citations found. Retrieving documents...
Baayen, R. H., R. Piepenbrock, and H. van Rijn. 1993. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA.

 Home/Search   Document Not in Database   Summary   Related Articles   Check  

This paper is cited in the following contexts:
TiMBL: Tilburg Memory-Based Learner - version 4.0.. - Daelemans, Zavrel.. (2001)   (Correct)

....nouns on the basis of their form. For these experiments, we collect a representation of nouns in terms of their syllable structure as training material 1 . For each of the last three syllables of the noun, four different features are 1 These words were collected form the CELEX lexical database (Baayen, Piepenbrock, and van Rijn, 1993) 7 CHAPTER 4. QUICK START TUTORIAL 8 Noun Form Suffix huis (house) huisje je man (man) mannetje etje raam (window) raampje pje woning (house) woninkje kje baan (job) baantje tje Table 4.1: Allomorphic variation in Dutch diminutives. collected: whether the syllable is stressed or not ....

Baayen, R. H., R. Piepenbrock, and H. van Rijn. 1993. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA.


Weighted Probability Distribution Voting, an introduction - van Halteren (1999)   (Correct)

....for comparison, with TiMBL (Daelemans et al. 1999b) In the grapheme to phoneme with stress task (GS) the system has to suggest the pronunciation of an English grapheme in a specific word and indicate whether it should be stressed. The Case collection is derived from the CELEX database (cf. Baayen et al. 1993)) The Indicators are the grapheme in question and up to three previous and three next graphemes (see Table 1) These Indicators have up to 42 different values (see Table 4 below) The output consists of one of 159 Classes in which phoneme and stress information are combined. The training set ....

R. H. Baayen, R. Piepenbrock, and H. van Rijn. 1993. The CELEX lexical data base on CDROM. Linguistic Data Consortium, Philadelphia, PA.


Bootstrapping a Tagged Corpus through Combination of.. - Zavrel, Daelemans (2000)   (Correct)

....WOTAN 1 (347 tags) or WOTANLITE (both with 641424 tokens of training data) or WOTAN 2 (1256 tags, and a slightly more modest 126803 tokens of training data) Berghmans, 1994; Van Halteren, 1999) tagsets. Furthermore we will use the ambiguous lexical categories 2 of words taken from the CELEX (Baayen et al. 1993) lexical database. The section of this database that we use, contains 300837 distinct word forms. 1 These were annotated by manually correcting tags produced by the first COMBI BOOTSTRAP taggers 2 Not including function words like determiners pronouns etc. i.e. adjective, adverb, noun, number, ....

Baayen, R. H., R. Piepenbrock, and H. van Rijn, 1993. The CELEX lexical data base on CD-ROM. Philadelphia, PA: Linguistic Data Consortium.


Integrating Seed Names and Ngrams for a Named Entity List .. - Buchholz, van den Bosch (2000)   (1 citation)  (Correct)

....these words do not directly denote named entities in a strict sense, they are nevertheless important for information extraction: a text that talks about American politics talks about America in a way. The seed list for the adjectival class was taken from the electronic Dutch dictionary CELEX (Baayen et al. 1993) from which we extracted all adjectives starting with a capital letter. 2.4. Training instances For each type that appears on any of the seed lists, we extracted all tokens together with a context of four words to the left and four words to the right from the corpus. 3 This yields 1,513,939 ....

....case letter stances that are classified correctly. more often than with a upper case. Along the same reasoning we can determine that in the sentence beginning Maar Jan . But John . only Jan and neither Maar nor Maar Jan is a name. If we use a list of closed class words (e.g. from CELEX, (Baayen et al. 1993)) we can even filter out similar cases in which the beginning of the sentence is not clearly marked 7 . A special problem is formed by last name prefixes (e.g. van den, cf. the second author s name) that are very common in Dutch and have to be written with a lower case letter if a first name ....

Baayen, R. H., R. Piepenbrock, and H. van Rijn, 1993. The CELEX lexical data base on CD-ROM. Philadelphia, PA: Linguistic Data Consortium.


Instance-Family Abstraction in Memory-Based Language Learning - van den Bosch   (Correct)

....instance s middle letter. An example instance and its classifications is h e a r t s , mapping to class A: which denotes an elongated short a sound to which the middle letter a maps. The data used in the experiments described here are derived from the CELEX lexical data base of English (Baayen, Piepenbrock, and van Rijn, 1993). Grapheme phoneme conversion combined with stress assignment (henceforth gs) is similar to the gp task, but differs in two respects: i) the windows only span seven letters, and (ii) the class represents a combined phoneme and a stress marker. Except for the data (derived from CELEX) the ....

.... A GRAPHEME PHONEME SUBSET First, we performed a series of experiments concerning the application of a range of careful abstracting methods to grapheme phoneme conversion (gp) From an original instance base of 77,565 word pronunciation pairs extracted from the CELEX lexical data base of English (Baayen, Piepenbrock, and van Rijn, 1993) we created ten equal sized data sets each containing 7,757 word pronunciation pairs. Using windowing and partitioning of this data in 90 training and 10 test instances, ten training and test sets are derived containing on average 60,813 and 6761 instances, respectively. These are token counts; ....

Baayen, R. H., R. Piepenbrock, and H. van Rijn. 1993. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA.


Machine Learning Of Word Pronunciation: The Case.. - Busser, Daelemans.. (1999)   (Correct)

....maps to class label 0A: denoting an elonged short a sound which is not the first phoneme of a syllable receiving primary stress. In this study, we chose a fixed window width of seven letters, which offers sufficient context information for adequate generalization performance [16] From celex [2] we extracted, on the basis of its lexical data base of 77,565 words with their corresponding phonemic transcription with stress markers, a data base containing 675,745 cases. The number of classes (i.e. all possible combinations of phonemes and stress markers) occurring in this data base is ....

R. H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA, 1993.


Careful Abstraction from Instance Families in Memory-Based.. - van den Bosch (1999)   (Correct)

....task is known for its sensitivity to abstraction, so that it is likely that any differences in abstraction methods show up most clearly in results obtained with this task. From an original instance base of 77,565 word pronunciation pairs extracted from the CELEX lexical data base of Dutch (Baayen, Piepenbrock, and van Rijn, 1993) we created ten equal sized data sets each containing 7,757 word pronunciation pairs. Using windowing (cf. Section 2) and partitioning of this data in 90 training and 10 test instances, ten training and test sets are derived containing on average 60,813 and 6761 instances, respectively. These ....

Baayen, R. H., R. Piepenbrock, and H. van Rijn. 1993. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA.


Forgetting Exceptions is Harmful in Language Learning - Daelemans, van den Bosch.. (1999)   (24 citations)  (Correct)

....is not the first phoneme of a syllable receiving primary stress. In this study, we chose a fixed window width of seven letters, which offers sufficient context information for adequate performance (in terms of the upper bound on error demanded by applications in speech technology) From celex (Baayen, Piepenbrock, and van Rijn, 1993) we extracted, on the basis of the standard word base of 77,565 words with their corresponding transcription, a data base containing 675,745 instances. The number of classes (i.e. all possible combinations of phonemes and stress markers) occurring in this data base is 159. 3.2. POS: ....

Baayen, R. H., R. Piepenbrock, and H. van Rijn. 1993. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA.


Rapid Development of NLP Modules with Memory-Based.. - Daelemans, van den.. (1998)   (7 citations)  (Correct)

....takes some time, depending of the completeness of annotation of the source data. For many languages, well annotated electronic pronunciation dictionaries are available that can be used for this purpose directly; for example, our English and Dutch data was extracted from the CELEX lexical data base [4]. Preprocessing of this type of data typically does not take more than a few days. 3.2. MBT: Part of speech tagging The MBT tagger generator [15, 14] takes an annotated corpus as input, and produces a lexicon and memory based part of speech (POS) tagger as output. The problem of POS tagging ....

R. H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA, 1993.


TiMBL: Tilburg Memory-Based Learner - version 3.0.. - Daelemans, Zavrel.. (2000)   (Correct)

No context found.

R. H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA, 1993.


TiMBL: Tilburg Memory Based Learner - version 2.0 -.. - Daelemans, Zavrel.. (1999)   (Correct)

No context found.

R. H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA, 1993.


TiMBL: Tilburg Memory-Based Learner - version 1.0 - .. - Daelemans, Zavrel, .. (1998)   (Correct)

No context found.

R. H. Baayen, R. Piepenbrock, and H. van Rijn. The CELEX lexical data base on CD-ROM. Linguistic Data Consortium, Philadelphia, PA, 1993.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC