Results 1 - 10
of
19
Visual Web Information Extraction with Lixto
- In The VLDB Journal
, 2001
"... We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been i ..."
Abstract
-
Cited by 157 (26 self)
- Add to MetaCart
We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been implemented, assists the user to semi-automatically create wrapper programs by providing a fully visual and interactive user interface. In this convenient user-interface very expressive extraction programs can be created. Internally, this functionality is reflected by the new logicbased declarative language Elog. Users never have to deal with Elog and even familiarity with HTML is not required. Lixto can be used to create an "XML-Companion" for an HTML web page with changing content, containing the continually updated XML translation of the relevant information. 1
Monadic Datalog and the Expressive Power of Languages for Web Information Extraction
- J. ACM
, 2002
"... Research on information extraction from Web pages (wrapping) has seen much activity in recent times (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this paper, w ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
Research on information extraction from Web pages (wrapping) has seen much activity in recent times (particularly systems implementations), but little work has been done on formally studying the expressiveness of the formalisms proposed or on the theoretical foundations of wrapping. In this paper, we first study monadic datalog as a wrapping language (over ranked or unranked tree structures). Using previous work by Neven and Schwentick, we show that this simple language is equivalent to full monadic second order logic (MSO) in its ability to specify wrappers. We believe that MSO has the right expressiveness required for Web information extraction and thus propose MSO as a yardstick for evaluating and comparing wrappers. Using the above result, we study the kernel fragment Elog- of the Elog wrapping language used in the Lixto system (a visual wrapper generator). The striking fact here is that Elog- exactly captures MSO, yet is easier to use. Indeed, programs in this language can be entirely visually specified. We also formally compare Elog to other wrapping languages proposed in the literature.
Monadic Queries over Tree-Structured Data
, 2002
"... Monadic query languages over trees currently receive considerable interest in the database community, as the problem of selecting nodes from a tree is the most basic and widespread database query problem in the context of XML. Partly a survey of recent work done by the authors and their group on log ..."
Abstract
-
Cited by 62 (7 self)
- Add to MetaCart
Monadic query languages over trees currently receive considerable interest in the database community, as the problem of selecting nodes from a tree is the most basic and widespread database query problem in the context of XML. Partly a survey of recent work done by the authors and their group on logical query languages for this problem and their expressiveness, this paper provides a number of new results related to the complexity of such languages over so-called axis relations (such as "child" or "descendant") which are motivated by their presence in the XPath standard or by their utility for data extraction (wrapping).
Information Integration Using Contextual Knowledge and Ontology Merging
, 2003
"... With the advances in telecommunications, and the introduction of the Internet, information systems achieved physical connectivity, but have yet to establish logical connectivity. Lack of logical connectivity is often inviting disaster as in the case of Mars Orbiter, which was lost because one team u ..."
Abstract
-
Cited by 39 (5 self)
- Add to MetaCart
With the advances in telecommunications, and the introduction of the Internet, information systems achieved physical connectivity, but have yet to establish logical connectivity. Lack of logical connectivity is often inviting disaster as in the case of Mars Orbiter, which was lost because one team used metric units, the other English while exchanging a critical maneuver data. In this Thesis, we focus on the two intertwined sub problems of logical connectivity, namely data extraction and data interpretation in the domain of heterogeneous information systems. The first challenge, data extraction, is about making it possible to easily exchange data among semi-structured and structured information systems. We describe the design and implementation of a general purpose, regular expression based Caméléon wrapper engine with an integrated capabilities-aware planner/optimizer/executioner. The second challenge, data interpretation, deals with the existence of heterogeneous contexts, whereby each source of information and potential receiver of that information may operate with a different context, leading to large-scale semantic heterogeneity. We extend the existing formalization of the COIN framework with new logical formalisms and features to handle larger
Deriving marketing intelligence from online discussion
- In KDD
, 2005
"... Weblogs and message boards provide online forums for discussion that record the voice of the public. Woven into this mass of discussion is a wide range of opinion and commentary about consumer products. This presents an opportunity for companies to understand and respond to the consumer by analyzing ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
Weblogs and message boards provide online forums for discussion that record the voice of the public. Woven into this mass of discussion is a wide range of opinion and commentary about consumer products. This presents an opportunity for companies to understand and respond to the consumer by analyzing this unsolicited feedback. Given the volume, format and content of the data, the appropriate approach to understand this data is to use large-scale web and text data mining technologies. This paper argues that applications for mining large volumes of textual data for marketing intelligence should provide two key elements: a suite of powerful mining and visualization technologies and an interactive analysis environment which allows for rapid generation and testing of hypotheses. This paper presents such a system that gathers and annotates online discussion relating to consumer products using a wide variety of state-of-the-art techniques, including crawling, wrapping, search, text classification and computational linguistics. Marketing intelligence is derived through an interactive analysis framework uniquely configured to leverage the connectivity and content of annotated online discussion.
The Lixto Data Extraction Project -- Back and Forth between Theory and Practice
- PODS 2004
, 2004
"... We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for w ..."
Abstract
-
Cited by 32 (1 self)
- Add to MetaCart
We present the Lixto project, which is both a research project in database theory and a commercial enterprise that develops Web data extraction (wrapping) and Web service definition software. We discuss the project's main motivations and ideas, in particular the use of a logic-based framework for wrapping. Then we present theoretical results on monadic datalog over trees and on Elog, its close relative which is used as the internal wrapper language in the Lixto system. These results include both a characterization of the expressive power and the complexity of these languages. We describe the visual wrapper specification process in Lixto and various practical aspects of wrapping. We discuss work on the complexity of query languages for trees that was inseminated by our theoretical study of logic-based languages for wrapping. Then we return to the practice of wrapping and the Lixto Transformation Server, which allows for streaming integration of data extracted from Web pages. This is a natural requirement in complex services based on Web wrapping. Finally, we discuss industrial applications of Lixto and point to open problems for future study.
The Personal Reader: Personalizing and Enriching Learning Resources using Semantic Web Technologies
- Proc. of the AH 2004
, 2004
"... Traditional adaptive hypermedia systems have focused on providing adaptation functionality on a closed corpus, while Web search interfaces have delivered non-personalized information to users. In this paper, we show how we integrate closed corpus adaptation and global context provision in a Pers ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
Traditional adaptive hypermedia systems have focused on providing adaptation functionality on a closed corpus, while Web search interfaces have delivered non-personalized information to users. In this paper, we show how we integrate closed corpus adaptation and global context provision in a Personal Reader environment. The local context consists of individually optimized recommendations to learning materials within the given corpus; the global context provides individually optimized recommendations to resources found on the Web, e. g., FAQs, student exercises, simulations, etc. The adaptive local context of a learning resource is generated by applying methods from adaptive educational hypermedia in a semantic web setting. The adaptive global context is generated by constructing appropriate queries, enrich them based on available user profile information, and, if necessary, relax them during the querying process according to available metadata.
connect, clone: combining application elements to build custom interfaces for information access
- Proc. UIST 2004
, 2004
"... Many applications provide a form-like interface for requesting information: the user fills in some fields, submits the form, and the application presents corresponding results. Such a procedure becomes burdensome if (1) the user must submit many different requests, for example in pursuing a trial-an ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Many applications provide a form-like interface for requesting information: the user fills in some fields, submits the form, and the application presents corresponding results. Such a procedure becomes burdensome if (1) the user must submit many different requests, for example in pursuing a trial-and-error search, (2) results from one application are to be used as inputs for another, requiring the user to transfer them by hand, or (3) the user wants to compare results, but only the results from one request can be seen at a time. We describe how users can reduce this burden by creating custom interfaces using three mechanisms: clipping of input and result elements from existing applications to form cells on a spreadsheet; connecting these cells using formulas, thus enabling result transfer between applications; and cloning cells so that multiple requests can be handled side by side. We demonstrate a prototype of these mechanisms, initially specialised for handling Web applications, and show how it lets users build new interfaces to suit their individual needs.
Reasoning Methods for Personalization on the Semantic Web
- Annals of Mathematics, Computing & Telefinformatics
, 2004
"... The Semantic Web vision of a next generation Web, in which machines are enabled to understand the meaning of information in order to better interoperate and better support humans in carrying out their tasks, is very appealing and fosters the imagination of smarter applications that can retrieve, pro ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
The Semantic Web vision of a next generation Web, in which machines are enabled to understand the meaning of information in order to better interoperate and better support humans in carrying out their tasks, is very appealing and fosters the imagination of smarter applications that can retrieve, process and present information in enhanced ways. In this vision, a particular attention should be devoted to personalization: By bringing the user's needs into the center of interaction processes, personalized Web systems overcome the one-size-fits-all paradigm and provide individually optimized access to Web data and information. In this paper, we provide an overview of recent trends for establishing personalization on the Semantic Web: Based on a discussion on reasoning with rule- and query languages for the Semantic Web, we outline an architecture for service-based personalization, and show results in personalizing Web applications.
Exploiting ASP for Semantic Information Extraction
- In Proceedings ASP05 - Answer Set Programming: Advances in Theory and Implementation
, 2005
"... WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
WWW home page:http://www.exeura.it Abstract. The paper describesHıLεX, a new ASP-based system for the extraction of information from unstructured documents. Unlike previous systems, which are mainly syntactic,HıLεX combines both semantic and syntactic knowledge for a powerful information extraction. In particular, the exploitation of background knowledge, stored in a domain ontology, allows to empower significantly the information extraction mechanisms. HıLεX is founded on a new two-dimensional representation of documents, and heavily exploits DLP + – an extension of disjunctive logic programming for ontology representation and reasoning which has been recently implemented on top of DLV. The domain ontology is represented in DLP +, and the extraction patterns are encoded by DLP + reasoning modules, whose execution yields the actual extraction of information from the input document. HıLεX allows to extract information from both HTML and flat text documents. 1

