• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Information Extraction from World Wide Web: A Survey,” Norwegian Computer (1999)

by L Eikvil
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 44
Next 10 →

Learning Hidden Markov Models for Information Extraction Actively from Partially Labeled Text

by Tobias Scheffer, Stefan Wrobel, Borislav Popov, Damyan Ognianov, Christian Decomain, Susanne Hoche , 2002
"... A vast range of information is expressed in unstructured or semi-structured text, in a form that is hard to decipher automatically. Consequently, it is of enormous importance to construct tools that allow users to extract information from textual documents as easily as it can be extracted from struc ..."
Abstract - Cited by 53 (0 self) - Add to MetaCart
A vast range of information is expressed in unstructured or semi-structured text, in a form that is hard to decipher automatically. Consequently, it is of enormous importance to construct tools that allow users to extract information from textual documents as easily as it can be extracted from structured databases. Information Extraction (IE)...

Mining and Modeling the Open . . .

by Jin Xu , 2007
"... ..."
Abstract - Cited by 28 (0 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...f web documents useful for a research objective. Our web mining process takes web documents as input, identifies a core fragment, and transforms that fragment into a structured and unambiguous format =-=[28]-=-. 2.5.1 Web Crawler Implementation In our Open Source Software study, a web crawler is developed to help extract useful information from OSS web resources [119] A web crawler is a program which automa...

Interactive Wrapper Generation with Minimal User Effort

by Utku Irmak, Torsten Suel - In Proc. of WWW , 2003
"... this paper, we describe a semi-automatic wrapper induction system with a powerful wrapper language that helps to capture sophisticated extraction scenarios. The main contributions of our work is the combination of a flexible user interface and algorithmic techniques to minimize the number of interac ..."
Abstract - Cited by 25 (0 self) - Add to MetaCart
this paper, we describe a semi-automatic wrapper induction system with a powerful wrapper language that helps to capture sophisticated extraction scenarios. The main contributions of our work is the combination of a flexible user interface and algorithmic techniques to minimize the number of interactions required in the training process. We also give preliminary evaluation results
(Show Context)

Citation Context

...ld on average be helpful. 8. RELATED WORK Data extraction from the web has been studied extensively over the last few years. Detailed discussions of various approaches can be found in several surveys =-=[7, 20, 17, 15, 14]-=-. We now discuss some of the most closely related work. Semi-automatic wrapper induction tools such as WIEN [16], SoftMealy [11], and Stalker [22] represent documents as sequences of tokens or charact...

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval

by Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, Hang Li - In Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval , 2005
"... This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML ..."
Abstract - Cited by 17 (5 self) - Add to MetaCart
This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6 % improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1 %-29.0 % improvements).
(Show Context)

Citation Context

...xperimental results. We make concluding remarks in Section 7. 2. RELATED WORK Web information extraction has become a popular research area recently and many issues have been intensively investigated =-=[10]-=-. Automatic extraction of web information has been studied for different information types. For instance, Liu et al. proposed a method of extracting data records from web pages [16]. Reis et al. inves...

Adaptive Information Extraction

by Jordi Turmo, Alicia Ageno
"... The growing availability of on-line textual sources and the potential number of applications of knowledge acquisition from textual data has lead to an increase in Information Extraction (IE) research. Some examples of these applications are the generation of data bases from documents, as well as the ..."
Abstract - Cited by 14 (1 self) - Add to MetaCart
The growing availability of on-line textual sources and the potential number of applications of knowledge acquisition from textual data has lead to an increase in Information Extraction (IE) research. Some examples of these applications are the generation of data bases from documents, as well as the acquisition of knowledge useful for emerging technologies like question answering, informa-tion integration, and others related to text mining. However, one of the main drawbacks of the application of IE refers to its intrinsic domain dependence. For the sake of reducing the high cost of manually adapting IE applications to new domains, experiments with dierent Machine Learning (ML) techniques have been carried out by the research community. This survey describes and compares the main approaches to IE and the dierent ML techniques used to achieve Adaptive IE technology. 1

Active Learning of Partially Hidden Markov Models

by Tobias Scheffer, Stefan Wrobel - In Proceedings of the ECML/PKDD Workshop on Instance Selection , 2001
"... We consider the task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled observation sequences are available for training. This setting is motivated by the information extraction problem, where only few tokens in the training documents are given a semantic tag while most t ..."
Abstract - Cited by 14 (2 self) - Add to MetaCart
We consider the task of learning hidden Markov models (HMMs) when only partially (sparsely) labeled observation sequences are available for training. This setting is motivated by the information extraction problem, where only few tokens in the training documents are given a semantic tag while most tokens are unlabeled. We first describe the partially hidden Markov model together with an algorithm for learning HMMs from partially labeled data. We then present an active learning algorithm that selects "difficult" unlabeled tokens and asks the user to label them. We study empirically by how much active learning reduces the required data labeling effort, or increases the quality of the learned model achievable with a given amount of user effort.

Exploiting ASP for Semantic Information Extraction

by Massimo Ruffolo, Nicola Leone, Marco Manna, Domenico Saccà, Amedeo Zavatto, Exeura S. R. L - In Proceedings ASP05 - Answer Set Programming: Advances in Theory and Implementation , 2005
"... WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. ..."
Abstract - Cited by 13 (5 self) - Add to MetaCart
WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. In particular, the exploitation of background knowledge, stored in a domain ontology, allows to empower significantly the information extraction mechanisms. HıLεX is founded on a new two-dimensional representation of documents, and heavily exploits DLP + – an extension of disjunctive logic programming for ontology representation and reasoning which has been recently implemented on top of DLV. The domain ontology is represented in DLP +, and the extraction patterns are encoded by DLP + reasoning modules, whose execution yields the actual extraction of information from the input document. HıLεX allows to extract information from both HTML and flat text documents. 1
(Show Context)

Citation Context

...In the recent literature a number of approaches for information extraction from unstructured documents have been proposed. An overview of the large body of existing literature and systems is given in =-=[1,2,3]-=-. Existing systems are mainly purely syntactic, and they are not aware of the semantic of the information they extract. ⋆ This work was supported by the European Commission under project IST-2001-3700...

Ontology-Based Extraction of RDF Data from the World Wide Web

by Tim Chartrand , 2003
"... The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hin ..."
Abstract - Cited by 11 (0 self) - Add to MetaCart
The simplicity and proliferation of the World Wide Web (WWW) has taken the availability of information to an unprecedented level. The next generation of the Web, the Semantic Web, seeks to make information more usable by machines by introducing a more rigorous structure based on ontologies. One hinderance to the Semantic Web is the lack of existing semantically marked-up data. Until there is a critical mass of Semantic Web data, few people will develop and use Semantic Web applications. This project helps promote the Semantic Web by providing content. We apply existing information-extraction techniques, in particular, the BYU ontologybased data-extraction system, to extract information from the WWW based on a Semantic Web ontology to produce Semantic Web data with respect to that ontology. As an example of how the generated Semantic Web data can be used, we provide an application to browse the extracted data and the source documents together. In this sense, the extracted data is superimposed over or is an index over the source documents. Our experiments with ontologies in four application domains show that our approach can indeed extract Semantic Web data from the WWW with precision and recall similar to that achieved by the underlying information extraction system and make that data accessible to Semantic Web applications.
(Show Context)

Citation Context

...There is an entire field of research called Information Extraction or Data Extraction that tries to extract unstructured or semistructured Web content so it can be stored and queried more efficiently =-=[Eik99]-=-[LRNdST02]. The BYU Data Extraction Group (DEG) [Hom02] has developed an ontologybased data-extraction system called Ontos [ECJ + 99]. Ontos uses a data-extraction ontology written in an extension of ...

Automated Information Extraction from Web Sources: a Survey

by Giacomo Fiumara
"... Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-struct ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Abstract. The Web contains an enormous quantity of information which is usually formatted for human users. This makes it difficult to extract relevant content from various sources. In the last few years some authors have addressed the problem to convert Web documents from unstructured or semi-structured format into structured and therefore machine-understandable format such as, for example, XML. In this paper we briefly survey some of the most promising and recently developed extraction tools. 1
(Show Context)

Citation Context

...export the relevant text to a structured format, normally XML. Wrappers consist of a series of rules and some code to apply those rules and, generally speaking, are specific to a source. According to =-=[6, 16]-=- a classification of Web wrappers can be made on the base of the kind of HTML pages that each wrapper is able to deal with. Three different types of Web pages can be distinguished: • unstructured page...

An agent-based approach to mailing list knowledge management

by Emanuela Moreale, Stuart Watt - Agent-Mediated Knowledge Management Lecture Notes in Artificial Intelligence , 2004
"... The widespread use of computers and of the internet have brought about human information overload, particularly in the areas of internet searches and email management. This has made Knowledge Management a necessity, particularly in a business context. Agent technology – with its metaphor of agent as ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
The widespread use of computers and of the internet have brought about human information overload, particularly in the areas of internet searches and email management. This has made Knowledge Management a necessity, particularly in a business context. Agent technology – with its metaphor of agent as assistant – has shown promise in the area of information overload and is therefore a good candidate for Knowledge Management solutions. This paper illustrates a mailing list Knowledge Management tool that is centred around the concept of a mailing list assistant. We envisage this system as the first step towards a comprehensive agent-based Knowledge Management solution.
(Show Context)

Citation Context

...nt is thus a must in today’s organisations. Mostswork in this area has focused on web pages. These effortssrange from information retrieval (IR) to informationsextraction (IE) and wrapper generation (=-=Eikvil 1999-=-).sOne of the most important types of document is email.sAccording to a survey commissioned by BT Cellnets(Sturgeon 2001), UK employees spend up to eight hourssper week on email. Most of us feel that ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University