Results 1 - 10
of
14
Visual Web Information Extraction with Lixto
- In The VLDB Journal
, 2001
"... We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been i ..."
Abstract
-
Cited by 157 (26 self)
- Add to MetaCart
We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been implemented, assists the user to semi-automatically create wrapper programs by providing a fully visual and interactive user interface. In this convenient user-interface very expressive extraction programs can be created. Internally, this functionality is reflected by the new logicbased declarative language Elog. Users never have to deal with Elog and even familiarity with HTML is not required. Lixto can be used to create an "XML-Companion" for an HTML web page with changing content, containing the continually updated XML translation of the relevant information. 1
WebViews: Accessing Personalized Web Content and Services
, 2001
"... The abilitytotake information, entertainment and e-commerce on the go has great promise. However, the existing Web infrastructure and contentwere designed for desktop computers and are not well-suited for other types of accesses, e.g., devices that have less processing power and memory, small screen ..."
Abstract
-
Cited by 31 (1 self)
- Add to MetaCart
The abilitytotake information, entertainment and e-commerce on the go has great promise. However, the existing Web infrastructure and contentwere designed for desktop computers and are not well-suited for other types of accesses, e.g., devices that have less processing power and memory, small screens, and limited input facilities, or through wireless data networks with low bandwidth and high latency. Thus, there is a growing need for techniques that provide alternative means to access Web content and services, be it the ability to browse the Web through a wireless PDAor smart phone, or hands-free access through voice interfaces. In this paper, we discuss issues involved in making existing Web content and services available for diverse environments, and describe WebViews, a system that allows casual Web users to easily create customized views of Web sites that are well-suited for differenttypes of terminals. In particular, we describe our approachtoprovide voice access to these Web views and experiences in building the system. Keywords content transcoding, dynamic content, electronic commerce, information delivery, personalization, smart bookmarks, voice interfaces, Web clipping, wrappers 1.
Declarative Information Extraction, Web Crawling, and Recursive Wrapping with Lixto
- In Proc. LPNMR’01
, 2001
"... Lixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML. ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
Lixto is a system and method for the visual and interactive generation of wrappers for Web pages under the supervision of a human developer, for automatically extracting information from Web pages using such wrappers, and for translating the extracted content into XML.
Automatic creation and simplified querying of semantic Web content: An approach based on information-extraction ontologies
- In Proceedings of the first Asian Semantic Web Conference (ASWC 2006) LNCS 4185
, 2006
"... Abstract. The semantic web represents a major advance in web utility, but it is difficult to create semantic-web content because pages must be semantically annotated through processes that are mostly manual and require a high degree of engineering skill. Furthermore, users need an effective way to q ..."
Abstract
-
Cited by 20 (9 self)
- Add to MetaCart
Abstract. The semantic web represents a major advance in web utility, but it is difficult to create semantic-web content because pages must be semantically annotated through processes that are mostly manual and require a high degree of engineering skill. Furthermore, users need an effective way to query the semantic web, but any burden we place on users to learn a query language is unlikely to garner sufficient user support and interest. If we want users to take advantage of the semantic web, we must devise a means for transforming existing (non-semantic) web pages into semantic web pages, and we must provide a simple and unrestricted interface for processing user queries. We propose using information extraction ontologies to handle both of these challenges. We show how a successful ontology-based data-extraction technique can (1) automatically generate semantic annotations for ordinary web pages, and (2) support free-form, textual queries. Our approach demonstrates how the semanticwebcanbecreatedforandusedbyordinarypeople.Wehave created an initial prototype to demonstrate that our proposal works.
Reverse engineering for web data: From visual to semantic structures
- In Intl. Conf. on Data Engineering (ICDE
, 2002
"... Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual renderingpurposes only, thus buildinga huge amount of ”legacy ” data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, e ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual renderingpurposes only, thus buildinga huge amount of ”legacy ” data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enrichingsuch Web documents with both structure and semantics is necessary. This paper describes a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element taggingprocess utilizes document restructuringrules and minimum information about the topic in form of concepts. For the resultingXML documents, a majority schema is derived that describes common structures amongthe documents in the form of a DTD. We explore and discuss different techniques and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applyingit to a setofr��sum� � HTML documents gathered by a Web crawler. 1
Information Extraction from Tree Documents by Learning Subtree Delimiters
- In: Proc. IIWeb’03
, 2003
"... Information extraction from HTML pages has been conventionally treated as plain text documents extended with HTML tags. However, the growing maturity and correct usage of HTML/XHTML formats open an opportunity to treat Web pages as trees, to mine the rich structural context in the trees and to ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Information extraction from HTML pages has been conventionally treated as plain text documents extended with HTML tags. However, the growing maturity and correct usage of HTML/XHTML formats open an opportunity to treat Web pages as trees, to mine the rich structural context in the trees and to learn accurate extraction rules. In this paper, we generalize the notion of delimiter developed for the string information extraction to tree documents.
Design and implementation of the physical layer in webbases: The XRover experience
- In First International Conference on Computational Logic, DOOD’2000 Stream
, 2000
"... , and I.V. Ramakrishnan 2 1 ..."
Semantic bookmarking for non-visual web access
- In ACM Conf. on Assistive Technologies (ASSETS
, 2004
"... Bookmarks are shortcuts that enable quick access of the desired Web content. They have become a standard feature in any browser and recent studies have shown that they can be very useful for non-visual Web access as well. Current bookmarking techniques in assistive Web browsers are rigidly tied to t ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Bookmarks are shortcuts that enable quick access of the desired Web content. They have become a standard feature in any browser and recent studies have shown that they can be very useful for non-visual Web access as well. Current bookmarking techniques in assistive Web browsers are rigidly tied to the structure of Web pages. Consequently they are susceptible to even slight changes in the structure of Web pages. In this paper we propose semantic bookmarking for non-visual Web access. With the help of an ontology that represents concepts in a domain, content in Web pages can be semantically associated with bookmarks. As long as these associations can be identified, semantic bookmarks are resilient in the face of structural changes to the Web page. The use of ontologies allows semantic bookmarks to span multiple Web sites covered by a common domain. This contributes to the ease of information retrieval and bookmark maintenance. In this paper we describe highly automated techniques for creating and retrieving semantic bookmarks. These techniques have been incorporated into an assistive Web browser. Preliminary experimental evidence suggests the effectiveness of semantic bookmarks for non-visual Web access.
Quixote: Building XML Repositories from Topic Specific Web Documents
- In Fourth Int. Workshop on the Web and Databases (WebDB'2001
, 2001
"... Despite major advancements in information retrieval techniques employed by today's Web search engines, building applications that allow users to efficiently manage, query, and utilize large collections of related Web documents from diverse, highly heterogeneous sources is still a hard problem. Even ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Despite major advancements in information retrieval techniques employed by today's Web search engines, building applications that allow users to efficiently manage, query, and utilize large collections of related Web documents from diverse, highly heterogeneous sources is still a hard problem. Even in the case where potentially related documents that pertain to the same topic can be gathered efficiently using, e.g., a focused Web crawler, the documents are still heterogeneous both in terms of structure and presentation, due to different authorship. More importantly, the documents are marked up in HTML for visual rendering purposes, thus hampering sophisticated query schemes different from simple keyword-based searches. In this paper, we outline the concepts and methods underlying Quixote, a system that allows users to rapidly build XML document repositories from large collections of topic specific HTML documents. Such documents are assumed to be gathered by a top...
Using Wrappers for Device Independent Web Access: Opportunities, Challenges and Limitations Extended Abstract
"... The availability of technologies that enable mobile access to data has brought great expectations that users would be able to access information, entertainment and e-commerce any time, anywhere. However, the existing Web infrastructure and content were designed for desktop computers and are not well ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The availability of technologies that enable mobile access to data has brought great expectations that users would be able to access information, entertainment and e-commerce any time, anywhere. However, the existing Web infrastructure and content were designed for desktop computers and are not well-suited for other types of accesses, e.g., devices that have less processing power and memory, small screens, and limited input facilities, or through wireless data networks with low bandwidth and high latency. Thus, there is a growing need for techniques that provide alternative means to access Web content and services, be it the ability to browse the Web through a wireless PDA or smart phone, or hands-free access through voice interfaces. In this paper, we discuss issues involved in providing ubiquitous access to Web data. We present techniques and systems for building wrappers that address these issues, and discuss their features and limitations in different application scenarios. 1

