Results 1 - 10
of
22
Learning Algorithms for Keyphrase Extraction
- INFORMATION RETRIEVAL
, 2000
"... Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful ..."
Abstract
-
Cited by 94 (3 self)
- Add to MetaCart
Many academic journals ask their authors to provide a list of about five to fifteen keywords, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a wide variety of tasks for which keyphrases are useful, as we discuss in this paper. We approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. We evaluate the performance of nine different configurations of C4.5. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for automatically extracting keyphrases from text. The experimental results support the claim that a custom-designed algorithm (GenEx)...
Information Extraction Using Hidden Markov Models
, 1997
"... This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose. In particular, the thesis presents a HMM that classifies and parses natural language assertions about genes being located at particular positions on chromoso ..."
Abstract
-
Cited by 76 (0 self)
- Add to MetaCart
This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose. In particular, the thesis presents a HMM that classifies and parses natural language assertions about genes being located at particular positions on chromosomes. The facts extracted by this HMM can be inserted into biological databases. The HMM is trained on a small set of sentence fragments chosen from the collected scientific abstracts in the OMIM (On-Line Mendelian Inheritance in Man) database and judged to contain the target binary relationship between gene names and gene locations. Given a novel sentence, all contiguous fragments are ranked by log-odds score, i.e. the log of the ratio of the probability of the fragment according to the target HMM to that according to a "null" HMM trained on all OMIM sentences. The most probable path through the HMM gives bindings for the annotations with precision as high as 80%. In contrast with traditional natural language processing methods, this stochastic approach makes no use either of part-of-speech taggers or dictionaries, instead employing non-emitting states to assemble modules roughly corresponding to noun, verb, and prepostional phrases. Algorithms for reestimating parameters for HMMs with non-emitting states are presented in detail. The ability to tolerate new words and recognize a wide variety of syntactic forms arises from the judicious use of "gap" states.
Description of the UMass system as used for MUC-6
- IN PROCEEDINGS OF THE 6TH MESSAGE UNDERSTANDING CONFERENCE
, 1995
"... ..."
Learning Information Extraction Patterns From Examples
- Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing
, 1995
"... A growing population of users want to extract a growing variety of information from on-line texts. Unfortunately, current information extraction systems typically require experts to hand-build dictionaries of extraction patterns for each new type of information to be extracted. This paper presents a ..."
Abstract
-
Cited by 63 (2 self)
- Add to MetaCart
A growing population of users want to extract a growing variety of information from on-line texts. Unfortunately, current information extraction systems typically require experts to hand-build dictionaries of extraction patterns for each new type of information to be extracted. This paper presents a system that can learn dictionaries of extraction patterns directly from user-provided examples of texts and events to be extracted from them. The system, called LIEP, learns patterns that recognize relationships between key constituents based on local syntax. Sets of patterns learned by LIEP for a sample extraction task perform nearly at the level of a hand-built dictionary of patterns. 1 Introduction Although significant progress has been made on information extraction systems in recent years (for instance through the MUC conferences [MUC, 1992; MUC, 1993]), coding the knowledge these systems need to extract new kinds of information and events is an arduous and time-consuming process [Ril...
Toward General-Purpose Learning for Information Extraction
, 1998
"... Two trends are evident iu the recent evolution of the field of information extraction: a preference for simple, often corpus-driven techniques over linguistically sophisticated ones; and a broadening of the central problem definition to include many non-traditional text domains. This development cal ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
Two trends are evident iu the recent evolution of the field of information extraction: a preference for simple, often corpus-driven techniques over linguistically sophisticated ones; and a broadening of the central problem definition to include many non-traditional text domains. This development calls for information extraction systems which are as retawetable and general as possible. Here, we describe SRV, a learning archi- tecture for information extraction which is de- signed for maximum generality and flexibility.
Learning to Extract Keyphrases from Text
, 1999
"... Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphra ..."
Abstract
-
Cited by 39 (4 self)
- Add to MetaCart
Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft's Word 97 and Verity's Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. T...
Relational Learning via Propositional Algorithms: An Information Extraction Case Study
, 2001
"... This paper develops a new paradigm for relational learning which allows for the representation and learning of relational information using propositional means. This paradigm suggests different tradeoffs than those in the traditional approach to this problem -- the ILP approach -- and as a resu ..."
Abstract
-
Cited by 39 (11 self)
- Add to MetaCart
This paper develops a new paradigm for relational learning which allows for the representation and learning of relational information using propositional means. This paradigm suggests different tradeoffs than those in the traditional approach to this problem -- the ILP approach -- and as a result it enjoys several significant advantages over it. In particular, the new paradigm is more flexible and allows the use of any propositional algorithm, including probabilistic algorithms, within it. We evaluate the new approach on an important and relation-intensive task - Information Extraction - and show that it outperforms existing methods while being orders of magnitude more efficient. 1
Data Mining on Symbolic Knowledge Extracted from the Web
, 2000
"... Information extractors and classifiers operating on unrestricted, unstructured texts are an errorful source of large amounts of potentially useful information, especially when combined with a crawler which automatically augments the knowledge base from the world-wide web. At the same time, there is ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
Information extractors and classifiers operating on unrestricted, unstructured texts are an errorful source of large amounts of potentially useful information, especially when combined with a crawler which automatically augments the knowledge base from the world-wide web. At the same time, there is much structured information on the World Wide Web. Wrapping the web-sites which provide this kind of information provide us with a second source of information; possibly less up-to-date, but reliable as facts. We give a case study of combining information from these two kinds of sources in the context of learning facts about companies. We provide results of association rules, propositional and relational learning, which demonstrate that data-mining can help us improve our extractors, and that using information from two kinds of sources improves the reliability of data-mined rules. 1. INTRODUCTION The World Wide Web has become a significant source of information. Most of this computer-retri...
Learning Text Analysis Rules For Domain-Specific Natural Language Processing
, 1997
"... An enormous amount of knowledge is needed to infer the meaning of unrestricted natural language. The problem can be reduced to a manageable size by restricting attention to a specific domain, which is a corpus of texts together with a predefined set of concepts that are of interest to that domain. T ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
An enormous amount of knowledge is needed to infer the meaning of unrestricted natural language. The problem can be reduced to a manageable size by restricting attention to a specific domain, which is a corpus of texts together with a predefined set of concepts that are of interest to that domain. Two widely different domains are used to illustrate this domain-specific approach. One domain is a collection of Wall Street Journal articles in which the target concept is management succession events: identifying persons moving into corporate management positions or moving out. A second domain is a collection of hospital discharge summaries in which the target concepts are various classes of diagnosis or symptom.
Inductive Logic Programming for Natural Language Processing
- IN MUGGLETON, S. (ED.), INDUCTIVE LOGIC PROGRAMMING: SELECTED PAPERS FROM THE 6TH INTERNATIONAL WORKSHOP
, 1997
"... This paper reviews our recent work on applying inductive logic programming to the construction of natural language processing systems. We have developed a system, Chill, that learns a parser from a training corpus of parsed sentences by inducing heuristics that control an initial overly-genera ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
This paper reviews our recent work on applying inductive logic programming to the construction of natural language processing systems. We have developed a system, Chill, that learns a parser from a training corpus of parsed sentences by inducing heuristics that control an initial overly-general shift-reduce parser. Chill learns syntactic parsers as well as ones that translate English database queries directly into executable logical form. The ATIS corpus of airline information queries was used to test the acquisition of syntactic parsers, and Chill performed competitively with recent statistical methods. English queries to a small database on U.S. geography were used to test the acquisition of a complete natural language interface, and the parser that Chill acquired was more accurate than an existing hand-coded system. The paper also includes a discussion of several issues this work has raised regarding the capabilities and testing of ILP systems as well as a summary of our current research directions.

