Results 1 - 10
of
44
Learning Hidden Markov Models for Information Extraction Actively from Partially Labeled Text
, 2002
"... A vast range of information is expressed in unstructured or semi-structured text, in a form that is hard to decipher automatically. Consequently, it is of enormous importance to construct tools that allow users to extract information from textual documents as easily as it can be extracted from struc ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
A vast range of information is expressed in unstructured or semi-structured text, in a form that is hard to decipher automatically. Consequently, it is of enormous importance to construct tools that allow users to extract information from textual documents as easily as it can be extracted from structured databases. Information Extraction (IE)...
Interactive Wrapper Generation with Minimal User Effort
- In Proc. of WWW
, 2003
"... this paper, we describe a semi-automatic wrapper induction system with a powerful wrapper language that helps to capture sophisticated extraction scenarios. The main contributions of our work is the combination of a flexible user interface and algorithmic techniques to minimize the number of interac ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
this paper, we describe a semi-automatic wrapper induction system with a powerful wrapper language that helps to capture sophisticated extraction scenarios. The main contributions of our work is the combination of a flexible user interface and algorithmic techniques to minimize the number of interactions required in the training process. We also give preliminary evaluation results
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval
- In Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval
, 2005
"... This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
(Show Context)
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6 % improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1 %-29.0 % improvements).
Adaptive Information Extraction
"... The growing availability of on-line textual sources and the potential number of applications of knowledge acquisition from textual data has lead to an increase in Information Extraction (IE) research. Some examples of these applications are the generation of data bases from documents, as well as the ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
The growing availability of on-line textual sources and the potential number of applications of knowledge acquisition from textual data has lead to an increase in Information Extraction (IE) research. Some examples of these applications are the generation of data bases from documents, as well as the acquisition of knowledge useful for emerging technologies like question answering, informa-tion integration, and others related to text mining. However, one of the main drawbacks of the application of IE refers to its intrinsic domain dependence. For the sake of reducing the high cost of manually adapting IE applications to new domains, experiments with dierent Machine Learning (ML) techniques have been carried out by the research community. This survey describes and compares the main approaches to IE and the dierent ML techniques used to achieve Adaptive IE technology. 1
Active Learning of Partially Hidden Markov Models
- In Proceedings of the ECML/PKDD Workshop on Instance Selection
, 2001
"... We consider the task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled observation sequences are available for training. This setting is motivated by the information extraction problem, where only few tokens in the training documents are given a semantic tag while most t ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
We consider the task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled observation sequences are available for training. This setting is motivated by the information extraction problem, where only few tokens in the training documents are given a semantic tag while most tokens are unlabeled. We first describe the partially hidden Markov model together with an algorithm for learning HMMs from partially labeled data. We then present an active learning algorithm that selects "difficult" unlabeled tokens and asks the user to label them. We study empirically by how much active learning reduces the required data labeling effort, or increases the quality of the learned model achievable with a given amount of user effort.
Exploiting ASP for Semantic Information Extraction
- In Proceedings ASP05 - Answer Set Programming: Advances in Theory and Implementation
, 2005
"... WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
(Show Context)
WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. In particular, the exploitation of background knowledge, stored in a domain ontology, allows to empower significantly the information extraction mechanisms. HıLεX is founded on a new two-dimensional representation of documents, and heavily exploits DLP + – an extension of disjunctive logic programming for ontology representation and reasoning which has been recently implemented on top of DLV. The domain ontology is represented in DLP +, and the extraction patterns are encoded by DLP + reasoning modules, whose execution yields the actual extraction of information from the input document. HıLεX allows to extract information from both HTML and flat text documents. 1
Ontology-Based Extraction of RDF Data from the World Wide Web
, 2003
"... The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hin ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hinderance to the Semantic Web is the lack of existing semantically marked-up data. Until there is a critical mass of Semantic Web data, few people will develop and use Semantic Web applications. This project helps promote the Semantic Web by providing content. We apply existing information-extraction techniques, in particular, the BYU ontologybased data-extraction system, to extract information from the WWW based on a Semantic Web ontology to produce Semantic Web data with respect to that ontology. As an example of how the generated Semantic Web data can be used, we provide an application to browse the extracted data and the source documents together. In this sense, the extracted data is superimposed over or is an index over the source documents. Our experiments with ontologies in four application domains show that our approach can indeed extract Semantic Web data from the WWW with precision and recall similar to that achieved by the underlying information extraction system and make that data accessible to Semantic Web applications.
Automated Information Extraction from Web Sources: a Survey
"... Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-struct ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-structured format into structured and therefore machine-understandable format such as, for example, XML. In this paper we briefly survey some of the most promising and recently developed extraction tools. 1
An agent-based approach to mailing list knowledge management
- Agent-Mediated Knowledge Management Lecture Notes in Artificial Intelligence
, 2004
"... The widespread use of computers and of the internet have brought about human information overload, particularly in the areas of internet searches and email management. This has made Knowledge Management a necessity, particularly in a business context. Agent technology – with its metaphor of agent as ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
The widespread use of computers and of the internet have brought about human information overload, particularly in the areas of internet searches and email management. This has made Knowledge Management a necessity, particularly in a business context. Agent technology – with its metaphor of agent as assistant – has shown promise in the area of information overload and is therefore a good candidate for Knowledge Management solutions. This paper illustrates a mailing list Knowledge Management tool that is centred around the concept of a mailing list assistant. We envisage this system as the first step towards a comprehensive agent-based Knowledge Management solution.