Results 1 - 10
of
249
Data Integration: A Theoretical Perspective
- Symposium on Principles of Database Systems
, 2002
"... Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interestin ..."
Abstract
-
Cited by 585 (35 self)
- Add to MetaCart
Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Information integration using logical views
, 1997
"... Abstract. A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algo-rithms for conju ..."
Abstract
-
Cited by 395 (4 self)
- Add to MetaCart
Abstract. A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algo-rithms for conjunctive queries and/or Datalog programs. Then we com-pare the approaches taken by AT&T Labs ' "Information Manifold " and the Stanford "Tsimmis " project in these terms. 1 Theoretical Background Before addressing information-integration issues, let us review some of the basic ideas concerning conjunctive queries, Datalog programs, and their containment. To begin, we use the logical rule notation from [Ull88]. Example 1. The following: p(X,Z):- a(X,Y) & a(Y,Z). is a rule that talks about a, an EDB predicate ("Extensional DataBase, " or stored relation), and p, an IDB predicate ("Intensional DataBase, " or predicate whose relation is constructed by rules). In this and several other examples, it is useful to think of a as an "arc " predicate defining a graph, while other predicates define certain structures that might exist in the graph. That is, a(X, Y) means there is an arc from node X to node Y. In this case, the rule says "p(X, Z) is true if there is an arc from node X to node Y and also an arc from Y to Z." That is, p represents paths of length 2. In general, there is one atom, the head, on the left of the "if " sign,:- and zero of more atoms, called subgoals, on the right side (the body). The head always has an IDB predicate; the subgoals can have IDB or EDB predicates. Thus, here p(X, Z) is the head, while a(X, Y) and a(Y, Z) are subgoals. We assume that each variable appearing in the head also appears somewhere in the body. This "safety " requirement assures that when we use a rule, we are not left with undefined variables in the head when we try to infer a fact about the head's predicate. We also assume that atoms consist of a predicate and zero or more arguments. An argument can be either a variable or a constant. However, we exclude function symbols from arguments.
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment
, 2000
"... Researchers in the ontology-design field have developed the content for ontologies in many domain areas. Recently, ontologies have become increasingly common on the WorldWide Web where they provide semantics for annotations in Web pages. This distributed nature of ontology development has led t ..."
Abstract
-
Cited by 336 (10 self)
- Add to MetaCart
Researchers in the ontology-design field have developed the content for ontologies in many domain areas. Recently, ontologies have become increasingly common on the WorldWide Web where they provide semantics for annotations in Web pages. This distributed nature of ontology development has led to a large number of ontologies covering overlapping domains. In order for these ontologies to be reused, they first need to be merged or aligned to one another. The processes of ontology alignment and merging are usually handled manually and often constitute a large and tedious portion of the sharing process. We have developed and implemented PROMPT, an algorithm that provides a semi-automatic approach to ontology merging and alignment. PROMPT performs some tasks automatically and guides the user in performing other tasks for which his intervention is required.
Semistructured data
, 1997
"... In semistructured data, the information that is normally as-sociated with a schema is contained within the data, which is sometimes called “self-describing”. In some forms of semi-structured data there is no separate schema, in others it exists but only places loose constraints on the data. Semi-str ..."
Abstract
-
Cited by 230 (0 self)
- Add to MetaCart
In semistructured data, the information that is normally as-sociated with a schema is contained within the data, which is sometimes called “self-describing”. In some forms of semi-structured data there is no separate schema, in others it exists but only places loose constraints on the data. Semi-structured data has recently emerged as an important topic of study for a variety of reasons. First, there are data sources such as the Web, which we would like to treat as databases but which cannot be constrained by a schema. Second, it may be desirable to have an extremely flexible format for data exchange between disparate databases. Third, even when dealing with structured data, it may be helpful to view it. as semistructured for the purposes of browsing. This tu-torial will cover a number of issues surrounding such data: finding a concise formulation, building a sufficiently expres-sive language for querying and transformation, and opti-mizat,ion problems. 1 The motivation The topic of semistructured data (also called unstructured data) is relatively recent, and a tutorial on the topic may well be premature. It represents, if anything, the conver-gence of a number of lines of thinking about new ways to represent and query data that do not completely fit with conventional data models. The purpose of this tutorial is to to describe this motivation and to suggest areas in which further research may be fruitful. For a similar exposition, the reader is referred to Serge Abiteboul’s recent survey pa-per PI. The slides for this tutorial will be made available from a section of the Penn database home page
Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity
, 1998
"... Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. ..."
Abstract
-
Cited by 193 (13 self)
- Add to MetaCart
Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIR...
XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
- In ICDE
, 2000
"... The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophist ..."
Abstract
-
Cited by 130 (7 self)
- Add to MetaCart
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applications need a smart way of extracting data from these web sources. One of the popular approaches is to write wrappers around the sources, either manually or with software assistance, to bring the web data within the reach of more sophisticated query tools and general mediator-based information integration systems. In this paper, we describe the methodology and the software development of an XML-enabled wrapper construction system- XWRAP for semi-automatic generation of wrapper programs. By XML-enabled we mean that the metadata about information content that are implicit in the original web pages will be extracted and encoded explicitly as XML tags in the wrapped documents. In addition, the query-based content ltering process is performed against the XML documents. The XWRAP wrapper generation framework has three distinct features. First, it explicitly separates
LARKS: Dynamic Matchmaking Among Heterogeneous Software Agents in Cyberspace
- in Cyberspace. Autonomous Agents and Multi-Agent Systems
, 2002
"... Introduction Theaeb{@ of servicesar deployed softwad aftwa in the mostfatb@ offspring of the Internet, the World Wide Web, isexponentiakp increatia Inabp{q}}b the Internet is a open environment, where infor maorb sources,communica}}{ links an anksb themselvesma aems a disaselv unpredicta}{} Thus,a ..."
Abstract
-
Cited by 114 (10 self)
- Add to MetaCart
Introduction Theaeb{@ of servicesar deployed softwad aftwa in the mostfatb@ offspring of the Internet, the World Wide Web, isexponentiakp increatia Inabp{q}}b the Internet is a open environment, where infor maorb sources,communica}}{ links an anksb themselvesma aems a disaselv unpredicta}{} Thus,a effective, affectiv seaec aa selection ofrelevaq services orab@pk isessentia forhuma usersae aersba well. We distinguish threegenera aner cara{p}fi in theCyberspaC{ service providers, service requester,a{ middle agents. Service providers provide some type of service, sucha finding informaorb} or performing somepab@}@v@b doma@ specific problem solving. Requesterauest need provideraovid to perform some service for them. Agentstha helploca{ othersah caers middle addle [6]. Matchmaking is the # Thisreseapv ha been sponsored inpa} by Office ofNa@{ Resea@} gra N-00014-96-16-1-1222, aby DARPAgraD F-30602-98-2-0138. 174 sycara et al. process of findinga aingb{fi@{v provider fora requester througha
Recursive Query Plans for Data Integration
- Journal of Logic Programming
, 1999
"... Generating query-answering plans for data integration systems requires to translate a user query, formulated in terms of a mediated schema, to a query that uses relations that are actually stored in data sources. Previous solutions to the translation problem produced sets of conjunctive plans, and w ..."
Abstract
-
Cited by 95 (4 self)
- Add to MetaCart
Generating query-answering plans for data integration systems requires to translate a user query, formulated in terms of a mediated schema, to a query that uses relations that are actually stored in data sources. Previous solutions to the translation problem produced sets of conjunctive plans, and were therefore limited in their ability to handle recursive queries and to exploit data sources with binding-pattern limitations and functional dependencies that are known to hold in the mediated schema. As a result, these plans were incomplete w.r.t. sources encountered in practice (i.e., produced only a subset of the possible answers). We describe the novel class of recursive query answering plans, which enables us to settle three open problems. First, we describe an algorithm for finding a query plan that produces the maximal set of answers from the sources for arbitrary recursive queries. Second, we extend this algorithm to use the presence of functional and full dependencies in the media...
InfoSleuth: Agent-Based Semantic Integration of Information in Open and Dynamic Environments
, 1997
"... The goM of the InfoSleuth project at MCC is to exploit and synthesize new technologies into a unified system that retrieves and processes information in an ever-changing net- work of information sources. InfoSleuth has its roots in the Carnot project at MCC, which specialized in integrating heteroge ..."
Abstract
-
Cited by 89 (4 self)
- Add to MetaCart
The goM of the InfoSleuth project at MCC is to exploit and synthesize new technologies into a unified system that retrieves and processes information in an ever-changing net- work of information sources. InfoSleuth has its roots in the Carnot project at MCC, which specialized in integrating heterogeneous information bases. However, recent emerging technologies such as internetworking and the World Wide Web have significantly expanded the types, availability, and volume of data available to an information management system. Furthermore, in these new environments, there is no formal control over the registration of new information sources, and applications tend to be developed without complete knowledge of the resources that will be available when they are run. Federated database projects such as Carnot that do static data integration do not scale up and do not cope well with this ever-changing environment. On the other hand, recent Web technologies, based on keyword search engines, are scalable but, unlike federated databases, are incapable of accessing information based on concepts. In this experience paper, we describe the architecture, design, and implementation of a working version of InfoSleuth. We show how InfoSleuth integrates new technological developments such as agent technology, domain ontologies, brokerage, and internet computing, in support of mediated interoperation of data and services in a dynamic and open environment. We demonstrate the use of information brokering and domain ontologies as key elements for sea/ability.
Learning Object Identification Rules for Information Integration
- Information Systems
, 2001
"... When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it di#cult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects' ..."
Abstract
-
Cited by 77 (8 self)
- Add to MetaCart
When integrating information from multiple websites, the same data objects can exist in inconsistent text formats across sites, making it di#cult to identify matching objects using exact text match. We have developed an object identification system called Active Atlas, which compares the objects' shared attributes in order to identify matching objects. Certain attributes are more important for deciding if a mapping should exist between two objects. Previous methods of object identification have required manual construction of object identification rules or mapping rules for determining the mappings between objects. This manual process is time consuming and error-prone.

