Results 1 - 10
of
39
Wrapper Induction: Efficiency and Expressiveness
- Artificial Intelligence
, 2000
"... The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatt ..."
Abstract
-
Cited by 191 (12 self)
- Add to MetaCart
The Internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, event listings, etc. Recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. However, these resources are usually formatted for use by people (e.g., the relevant content is embedded in HTML pages), so extracting their content is difficult. Most systems use customized wrapper procedures to perform this extraction task. Unfortunately, writing wrappers is tedious and error-prone. As an alternative, we advocate wrapper induction, a technique for automatically constructing wrappers. In this article, we describe six wrapper classes, and use a combination of empirical and analytical techniques to evaluate the computational tradeoffs among them. We first consider expressiveness: how well the classes can handle actual Internet resources, and the extent to which wrappers in one class can mimic those in another. We then...
Information Extraction Using Hidden Markov Models
, 1997
"... This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose. In particular, the thesis presents a HMM that classifies and parses natural language assertions about genes being located at particular positions on chromoso ..."
Abstract
-
Cited by 76 (0 self)
- Add to MetaCart
This thesis shows how to design and tune a hidden Markov model to extract factual information from a corpus of machine-readable English prose. In particular, the thesis presents a HMM that classifies and parses natural language assertions about genes being located at particular positions on chromosomes. The facts extracted by this HMM can be inserted into biological databases. The HMM is trained on a small set of sentence fragments chosen from the collected scientific abstracts in the OMIM (On-Line Mendelian Inheritance in Man) database and judged to contain the target binary relationship between gene names and gene locations. Given a novel sentence, all contiguous fragments are ranked by log-odds score, i.e. the log of the ratio of the probability of the fragment according to the target HMM to that according to a "null" HMM trained on all OMIM sentences. The most probable path through the HMM gives bindings for the annotations with precision as high as 80%. In contrast with traditional natural language processing methods, this stochastic approach makes no use either of part-of-speech taggers or dictionaries, instead employing non-emitting states to assemble modules roughly corresponding to noun, verb, and prepostional phrases. Algorithms for reestimating parameters for HMMs with non-emitting states are presented in detail. The ability to tolerate new words and recognize a wide variety of syntactic forms arises from the judicious use of "gap" states.
An Information Extraction Core System for Real World German Text Processing
- In 5th International Conference of Applied Natural Language
, 1997
"... This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be cus ..."
Abstract
-
Cited by 73 (17 self)
- Add to MetaCart
This paper describes SMES, an information extraction core system for real world German text processing. The basic design criterion of the system is of providing a set of basic powerful, robust, and efficient natural language components and generic linguistic knowledge sources which can easily be customized for processing different tasks in a flexible manner.
Two applications of information extraction to biological science journal articles: Enzyme interactions and protein structures
, 2000
"... Information extraction technology, as de ned and developed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this paper we consider the application ..."
Abstract
-
Cited by 73 (3 self)
- Add to MetaCart
Information extraction technology, as de ned and developed through the U.S. DARPA Message Understanding Conferences (MUCs), has proved successful at extracting information primarily from newswire texts and primarily in domains concerned with human activity. In this paper we consider the application of this technology to the extraction of information from scienti c journal papers in the area of molecular biology. In particular, we describe how an information extraction system designed to participate in the MUC exercises has been modi ed for two bioinformatics applications: EMPathIE, concerned with enzyme and metabolic pathways; and PASTA, concerned with protein structure. Progress to date provides convincing grounds for believing that IE techniques will deliver novel and e ective ways for scientists to make use of the core literature which de nes their disciplines. 1
Learning Information Extraction Patterns From Examples
- Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing
, 1995
"... A growing population of users want to extract a growing variety of information from on-line texts. Unfortunately, current information extraction systems typically require experts to hand-build dictionaries of extraction patterns for each new type of information to be extracted. This paper presents a ..."
Abstract
-
Cited by 63 (2 self)
- Add to MetaCart
A growing population of users want to extract a growing variety of information from on-line texts. Unfortunately, current information extraction systems typically require experts to hand-build dictionaries of extraction patterns for each new type of information to be extracted. This paper presents a system that can learn dictionaries of extraction patterns directly from user-provided examples of texts and events to be extracted from them. The system, called LIEP, learns patterns that recognize relationships between key constituents based on local syntax. Sets of patterns learned by LIEP for a sample extraction task perform nearly at the level of a hand-built dictionary of patterns. 1 Introduction Although significant progress has been made on information extraction systems in recent years (for instance through the MUC conferences [MUC, 1992; MUC, 1993]), coding the knowledge these systems need to extract new kinds of information and events is an arduous and time-consuming process [Ril...
Information Extraction: Beyond Document Retrieval
- COMPUTATIONAL LINGUISTICS AND CHINESE LANGUAGE PROCESSING
, 1998
"... In this paper we give a synoptic view of the growth text processing technology of information extraction (IE) whose function is to extract information about a pre-specified set of entities, relations or events from natural language textsand to record this information in structured representations ..."
Abstract
-
Cited by 48 (10 self)
- Add to MetaCart
In this paper we give a synoptic view of the growth text processing technology of information extraction (IE) whose function is to extract information about a pre-specified set of entities, relations or events from natural language textsand to record this information in structured representations called templates. Here we describe the nature of the IE task, review the history of the area from its origins in AI work in the 1960's and 70's till the present, discuss the techniques being used to carry out the task, describe application areas where IE systems are or are about to be at work, and conclude with a discussion of the challenges facing the area. What emerges is a picture of an exciting new text processing technology with a host of new applications, both on its own and in conjunction with other technologies, such as information retrieval, machine translation and data mining.
Programming By Demonstration Using Version Space Algebra
, 2001
"... Programming by demonstration enables users to easily personalize their applications, automating repetitive tasks simply by executing a few examples. We formalize programming by demonstration as a machine learning problem: given the changes in the application state that result from the user's demonst ..."
Abstract
-
Cited by 37 (7 self)
- Add to MetaCart
Programming by demonstration enables users to easily personalize their applications, automating repetitive tasks simply by executing a few examples. We formalize programming by demonstration as a machine learning problem: given the changes in the application state that result from the user's demonstrated actions, learn the general program that maps from one application state to the next. We present a methodology for learning in this space of complex functions. First we extend version spaces to learn arbitrary functions, not just concepts. Then we introduce the version space algebra, a method for composing simpler version spaces to construct more complex spaces. Finally, we apply our version space algebra to the text-editing domain and describe an implemented system called SMARTedit that learns repetitive text-editing procedures by example. We evaluate our approach by measuring the number of examples required for the system to learn a procedure that works on the remainder of examples, and by an informal user study measuring the e#ort users spend using our system versus performing the task by hand. The results show that SMARTedit is capable of generalizing correctly from as few as one or two examples, and that users generally save a significant amount of e#ort when completing tasks with SMARTedit's help.
Text mining and ontologies in biomedicine: making sense of raw text
- Briefings in Bioinformatics
, 2005
"... The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill info rmation, extract facts, discover implicit links and generate hypotheses relevant to use ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
The volume of biomedical literature is increasing at such a rate that it is becoming difficult to locate, retrieve and manage the reported information without text mining, which aims to automatically distill info rmation, extract facts, discover implicit links and generate hypotheses relevant to user needs. Ontologies, as conceptual models, provide the necessary framework for semantic representation of textual information. The principal link between text and an ontology is terminology, which maps terms to domain-specific concepts. In this article, we summarize different approaches in which ontologies have been used for text mining applications in biomedicine.
Software Architecture for Language Engineering
, 2000
"... This thesis defines the boundaries of Software Architecture for Language Engineering (SALE), an area formed by the intersection of human language computation and software engineering. SALE covers all areas of the provision of infrastructural systems to support research and development of language pr ..."
Abstract
-
Cited by 21 (7 self)
- Add to MetaCart
This thesis defines the boundaries of Software Architecture for Language Engineering (SALE), an area formed by the intersection of human language computation and software engineering. SALE covers all areas of the provision of infrastructural systems to support research and development of language processing software. In order to demonstrate the theory developed in relation to SALE, we present the design, implementation and evaluation of GATE, a General Architecture for Text Engineering, which illustrates in practice many of the theoretical points made.

