Results 11 - 20
of
144
Unsupervised Language Acquisition: Theory and Practice
, 2001
"... In this thesis I present various algorithms for the unsupervised machine learning of aspects of natural languages using a variety of statistical models. The scientific object of the work is to examine the validity of the so-called Argument from the Poverty of the Stimulus advanced in favour of the p ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
In this thesis I present various algorithms for the unsupervised machine learning of aspects of natural languages using a variety of statistical models. The scientific object of the work is to examine the validity of the so-called Argument from the Poverty of the Stimulus advanced in favour of the proposition that humans have language-specific innate knowledge. I start by examining an a priori argument based on Gold's theorem, that purports to prove that natural languages cannot be learned, and some formal issues related to the choice of statistical grammars rather than symbolic grammars. I present three novel algorithms for learning various parts of natural languages: first, an algorithm for the induction of syntactic categories from unlabelled text using distributional information, that can deal with ambiguous and rare words; secondly, a set of algorithms for learning morphological processes in a variety of languages, including languages such as Arabic with nonconcatenative morphology; thirdly an algorithm for the unsupervised induction of a context-free grammar from tagged text. I carefully examine the interaction between the various components, and show how these algorithms can form the basis for a empiricist model of language acquisition. I therefore conclude that the Argument from the Poverty of the Stimulus is unsupported by the evidence.
The Grammar of Sense: Using part-of-speech tags as a first step in semantic disambiguation
, 1997
"... This paper describes two experiments: one exploring the amount of information relevant to sense disambiguation contained in the part-of-speech field of entries in a Machine Readable Dictionary (MRD); the other, more practical, experiment attempts sense disambiguation of all content words in a text a ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
This paper describes two experiments: one exploring the amount of information relevant to sense disambiguation contained in the part-of-speech field of entries in a Machine Readable Dictionary (MRD); the other, more practical, experiment attempts sense disambiguation of all content words in a text assigning MRD homographs as sense tags using only partof -speech information. We have implemented a simple sense tagger which successfully tags 94% of words using this method. A plan to extend this work and implement an improved sense tagger is included. Contents 1 Introduction 1 2 Work so far 2 3 Experiments using part-of-speech 4 3.1 The Structure of a Lexicon: A Gedankenexperiment 5 3.2 Using a Tagger: A Practical Experiment 7 4 Conclusion 8 5 Further work 10 References 11 1 Introduction Sense tagging is the process of assigning the appropriate sense from a lexicon to each word token in a text 1 , similar to the way a grammatical category is assigned in part1 This is often loosen...
The grammar of sense: Is word-sense tagging much more than part-of-speech tagging? cmp-lg/9607028
, 1996
"... This squib claims that Large-scale Automatic Sense Tagging of text (LAST) can be done at a high-level of accuracy and with far less complexity and computational effort than has been believed until now. Moreover, it can be done for all open class words, and not just carefully selected opposed pairs a ..."
Abstract
-
Cited by 25 (1 self)
- Add to MetaCart
This squib claims that Large-scale Automatic Sense Tagging of text (LAST) can be done at a high-level of accuracy and with far less complexity and computational effort than has been believed until now. Moreover, it can be done for all open class words, and not just carefully selected opposed pairs as in some recent work. We describe two experiments: one exploring the amount of information relevant to sense disambiguation which is contained in the part-of-speech field of entries in Longman Dictionary of Contemporary English (LDOCE). Another, more practical, experiment attempts sense disambiguation of all open class words in a text assigning LDOCE homographs as sense tags using only part-of-speech information. We report that 92 % of open class words can be successfully tagged in this way. We plan to extend this work and to implement an improved large-scale tagger, a description of which is included here.
Exploiting Stylistic Idiosyncrasies for Authorship Attribution
- In IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis
, 2003
"... Introduction Early researchers in authorship attribution used a variety of statistical methods to identify stylistic discriminators characteristics which remain approximately invariant within the works of a given author but which tend to vary from author to author (Holmes 1998, McEnery & Oakes ..."
Abstract
-
Cited by 23 (3 self)
- Add to MetaCart
Introduction Early researchers in authorship attribution used a variety of statistical methods to identify stylistic discriminators characteristics which remain approximately invariant within the works of a given author but which tend to vary from author to author (Holmes 1998, McEnery & Oakes 2000). In recent years machine learning methods have been applied to authorship attribution. A few examples include (Matthews & Merriam 1993, Holmes & Forsyth 1995, Stamatatos et al 2001, de Vel et al 2001). Both the earlier "stylometric" work and the more recent machine-learning work have tended to focus on initial sets of candidate discriminators which are fairly ubiquitous. For example, the classical work of Mosteller and Wallace (1964) on the Federalist Papers used a set of several hundred function words, that is, words that are context-independent and hence unlikely to be biased towards specific topics. Other features used in even earlier work (Yule 1938) are complexity-base
Learning Information Extraction Rules: An Inductive Logic Programming approach
, 2002
"... The objective of this work is to learn information extraction rules by applying Inductive Logic Programming (ILP) techniques to natural language data. The approach is ontology-based, which means that the extraction rules conclude with specific ontology relations that characterise the meaning of sent ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
The objective of this work is to learn information extraction rules by applying Inductive Logic Programming (ILP) techniques to natural language data. The approach is ontology-based, which means that the extraction rules conclude with specific ontology relations that characterise the meaning of sentences in the text. An existing ILP system, FOIL, is used to learn attribute-value relations. This enables instances of these relations to be identified in the text. In specific, we explore the linguistic preprocessing of the data, the use of background knowledge in the learning process, and the practical considerations of applying a supervised learning approach to rule induction, i.e. in terms of the human effort in creating the data set, and in the inherent biases in the use of small data sets.
Musical Query-by-Description as a Multiclass Learning Problem
- In Proc. IEEE Multimedia Signal Processing Conference (MMSP
, 2002
"... We present the query-by-description (QBD) component of "Kandem," a time-aware music retrieval system. The QBD system we describe learns a relation between descriptive text concerning a musical artist and their actual acoustic output, making such queries as "Play me something loud with an electronic ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
We present the query-by-description (QBD) component of "Kandem," a time-aware music retrieval system. The QBD system we describe learns a relation between descriptive text concerning a musical artist and their actual acoustic output, making such queries as "Play me something loud with an electronic beat" possible by merely analyzing the audio content of a database. We show a novel machine learning technique based on Regularized Least-Squares Classification (RLSC) that can quickly and efficiently learn the non-linear relation between descriptive language and audio features by treating the problem as a large number of possible output classes linked to the same set of input features. We show how the RLSC training can easily eliminate irrelevant labels. I.
Sense Tagging: Semantic Tagging with a Lexicon
- IN PROCEEDINGS OF THE SIGLEX WORKSHOP
, 1997
"... Sense tagging, the automatic assignment of the appropriate sense from some lexicon to each of the words in a text, is a specialised instance of the general problem of semantic tagging by category or type. We discuss which recent word sense disambiguation algorithms are appropriate for sense ta ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
Sense tagging, the automatic assignment of the appropriate sense from some lexicon to each of the words in a text, is a specialised instance of the general problem of semantic tagging by category or type. We discuss which recent word sense disambiguation algorithms are appropriate for sense tagging. It is our belief that sense tagging can be carried out effectively by combining several simple, independent, methods and we include the design of such a tagger. A prototype of this system has been implemented, correctly tagging 86% of polysemous word tokens in a small test set, providing evidence that our hypothesis is correct.
Generalized unknown morpheme guessing for hybrid POS tagging of Korean
- In Proceedings of SIXTH WORKSHOP ON VERY LARGE CORPORA in Coling-ACL 98
, 1998
"... Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with POSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morph ..."
Abstract
-
Cited by 18 (13 self)
- Add to MetaCart
Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with POSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general lexical patterns of Korean morphemes with a posteriori syllable tri-gram estimation. The syllable tri-grams help to calculate lexical probabilities of the unknown morphemes and are utilized to search the best tagging result. In our scheme, we can guess the POS's of unknown morphemes regardless of their numbers and positions in an eojeol, which was not possible before in Korean tagging systems. In a series of experiments using three different domain corpora, we can achieve 97% tagging accuracy regardless of many unknown morphemes in test corpora.
Text Mining - Knowledge extraction from unstructured textual data
- 6th Conference of International Federation of Classification Societies (IFCS-98
, 1998
"... In the general context of Knowledge Discovery, specific techniques, called Text Mining techniques, are necessary to extract information from unstructured textual data. The extracted information can then be used for the classification of the content of large textual bases. In this paper, we present t ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
In the general context of Knowledge Discovery, specific techniques, called Text Mining techniques, are necessary to extract information from unstructured textual data. The extracted information can then be used for the classification of the content of large textual bases. In this paper, we present two examples of information that can be automatically extracted from text collections: probabilistic associations of key-words and prototypical document instances. The Natural Language Processing (NLP) tools necessary for such extractions are also presented.
Phonological Parsing for Bi-directional Letterto-Sound/Sound-to-Letter Generation
- Journal of Speech Communication
, 1995
"... In this paper, we describe a reversible letter-to-sound/sound-to-letter generation system based on an approach which com-bines a rule-based formalism with data-driven techniques. We adopt a probabilistic parsing strategy to provide a hierarchical lexical analysis of a word, including information suc ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
In this paper, we describe a reversible letter-to-sound/sound-to-letter generation system based on an approach which com-bines a rule-based formalism with data-driven techniques. We adopt a probabilistic parsing strategy to provide a hierarchical lexical analysis of a word, including information such as mor-phology, stress, syllabification, phonemics and graphemics. Long-distance constraints are propagated by enforcing local constraints throughout the hierarchy. Our training and test-ing corpora are derived from the high-frequency portion of the Brown Corpus (10,000 words), augmented with markers indicating stress and word morphology. We evaluated our performance based on an unseen test set. The percentage of nonparsable words for letter-to-sound and sound-to-letter generation were 6 % and 5 % respectively. Of the remaining words our system achieved a word accuracy of 71.8~0 and a phoneme accuracy of 92.5 % for letter-to-sound generation, and a word accuracy of 55.8 % and letter accuracy of 89.4% for sound-to-letter generation. We also compared our hierar-chical approach with an alternative, single-layer approach to demonstrate how the hierarchy provides a parsimonious de-scription for English orthographic-phonological regularities, while simultaneously attaining competitive generation accu-racy.

