| Florescu, D., Koller, D., Levy, A.: Using probabilistic information in data integration. In: Proceedings of the International Conference on Very Large Databases (VLDB), Athens, |
....first. Query execution can then be aborted as soon as the user has found a satisfactory answer, or when allotted resource limits have been reached. Example 1. 2 Consider plan coverage, defined as the number of tuples returned by a plan that haven t been returned by any plan executed previously [6, 7, 12]. If sources have equal access cost, then executing query plans in the decreasing order of their coverage returns as many answers as possible as soon as possible. Consequently, it maximizes the likelihood of obtaining a satisfactory answer early [6] If sources have differing access cost, ....
....by any plan executed previously [6, 7, 12] If sources have equal access cost, then executing query plans in the decreasing order of their coverage returns as many answers as possible as soon as possible. Consequently, it maximizes the likelihood of obtaining a satisfactory answer early [6]. If sources have differing access cost, however, then prefer ences over coverage and cost can be modeled with the utility measure ###### # ############# # #######, where # and # are constants specifying the tradeoffs [18] Executing plans in the decreasing order of this utility value balances ....
[Article contains additional citation context not shown here]
D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In Proc. of VLDB '97.
....interacts with a mediator system via a mediated schema. A mediated schema is a set of virtual relations, which are effectively stored across multiple and potentially overlapping data sources, each of which only contain a partial extension of the relation. Query optimization in data integration [FKL97, NLF99, NK01, DH02 ] thus requires the ability to figure out what sources are most relevant to the given query, and in what order those sources should be accessed. For this purpose, the query optimizer needs to access statistics about the coverage of the individual sources with respect to the ....
....the query frequency of our experimental system. We setup a one second delay for answering each query sent to a source to simulate the probing cost. In order to evaluate the effectiveness of our learned statistics, we implemented the Simple Greedy and Greedy Select algorithms described in [FKL97] to generate query plans using the learned source coverage and overlap statistics. Simple greedy generates plans by greedily selecting top k sources ranked according to their coverages, while Greedy select selects sources with high residual coverages calculated using both the coverage and ....
[Article contains additional citation context not shown here]
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1997.
....and potentially overlapping data sources, each of which may only contain a partial extension of the relation. Query optimization in data integration thus requires the ability to figure out what sources are most relevant to the given query, and in what order those sources should be accessed [FKL97, NLF99, NK01, DH02]. For this purpose, the query optimizer needs access to statistics about the coverage of the individual sources with respect to the given query, as well as the degree to which the answers they export overlap. We illustrate the need for these statistics with an example. BibFinder Example: We have ....
....not be authorized, it is impractical to assume that the sources will automatically export coverage and overlap statistics. Consequently, data integration systems should be able to learn the necessary statistics. Although previous work has addressed the issue of how to model these statistics (c.f. [FKL97]) and how to use them as part of query optimization (c.f. NLF99] NK01] DH02] there has not been any work on effectively learning the statistics in the first place. In our research, we address the problem of learning the coverage and overlap statistics for sources with respect to user ....
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1997.
....and only then capture the true value of information sources. To the best of our knowledge the density criterion as we define it has never before been addressed in literature. Florescu et al. quantitatively describe the content of distributed autonomous document sources using probabilistic measures [4]. In their model, the authors calculate two values: Coverage of data sources, determining the probability that a given document is found in the source, and overlap between two data sources, determining the probability that an arbitrary document is found in both sources. These probabilities are ....
D. Florescu, D. Koller, and A. Levy, "Using probabilistic information in data integration," in Proceedings of the International Conference on Very Large Databases (VLDB), Athens, Greece, 1997, pp. 216--225.
....first. Query execution can then be aborted as soon as the user has found a satisfactory answer, or when allotted resource limits have been reached. Example 1. 2 Consider plan coverage, defined as the number of tuples returned by a plan that haven t been returned by any plan executed previously [6, 7, 12]. If sources have equal access cost, then executing query plans in the decreasing order of their coverage returns as many answers as possible as soon as possible. Consequently, it maximizes the likelihood of obtaining a satisfactory answer early [6] If sources have differing access cost, however, ....
....returned by any plan executed previously [6, 7, 12] If sources have equal access cost, then executing query plans in the decreasing order of their coverage returns as many answers as possible as soon as possible. Consequently, it maximizes the likelihood of obtaining a satisfactory answer early [6]. If sources have differing access cost, however, then prefer ences over coverage and cost can be modeled with the utility measure u(p) coverage(p) cost(p) where and are constants specifying the tradeoffs [18] Executing plans in the decreasing order of this utility value ....
[Article contains additional citation context not shown here]
D. Florescu, D. Koller, and A. Y. Levy. Using probabilistic information in data integration. In Proc. of VLDB '97.
....bindings. However, the computed answer may not be the complete answer. As we will see in Section 7, we can sometimes use the approach in these papers to compute the complete answer to a nonstable conjunctive query. Other works on computing answers to queries given incomplete source data include [FKL97, GGH98, KL88, Lev96, MM01] and these studies do not consider limited access patterns to relations. The dynamic case of computing a complete answer to a nonstable query, as illustrated in Section 7, is di#erent from the case of dynamic mediators discussed in [YLGMU99] In [YLGMU99] source ....
Daniela Florescu, Daphne Koller, and Alon Y. Levy. Using probabilistic information in data integration. In Proc. of VLDB, pages 216--225, 1997.
....interacts with a mediator system via a mediated schema. A mediated schema is a set of virtual relations, which are effectively stored across multiple and potentially overlapping data sources, each of which only contain a partial extension of the relation. Query optimization in data integration [FKL97, DL99, NLF99, NK01 ] thus requires the ability to figure out what sources are most relevant to the given query, and in what order those sources should be accessed. For this purpose, the query optimizer needs to access statistics about the coverage of the individual sources with respect to the ....
....be authorized, it is impractical to assume that the sources will automatically export coverage and overlap statistics. Consequently, Web data integration systems should be able to learn the necessary statistics. Although previous work has addressed the issue of how to model these statistics (c.f. FKL97] and how to use them as part of query optimization (c.f. DL99,NLF99,NK01] there has not been any work on effectively learning the statistics in the first place. 1.1 The StatMiner approach In this paper, we address the problem of learning the coverage and overlap statistics for sources with ....
[Article contains additional citation context not shown here]
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1997.
.... is of increasing importance. The extraction of the schema from a semistructured data follows either unsupervised categorization or supervised categorization. The former we call clustering, the latter classification. Many researchers focus on classification of semistructured data or documents [4, 10]. One example of classification is summarization of documents [8] The approach presented in this paper is instead to cluster data. In general, semistructured data sets are clustered according to XML elements. After a schema implicit in the cluster of semistructured data is extracted, ....
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proc. Intl. Conf on Very Large Data Bases, 1997.
.... database can be undesirable for other reasons aside from the obvious issues of ineciency, redundant copies of the same information can lead to redundant answers (or redundant explanations for the same answer) Correct, robust treatment of redundancy is dicult issue, and previous research (e.g. [5]) has addressed only some of the issues. A complication associated with the use of WHIRL is that since object identities can be uncertain, determining if even a single pair of facts is redundant can be dicult. 3.3 Developing an integration system Both applications had a similar development ....
Daniela Florescu, Daphne Koller, and Alon Levy. Using probabilistic information in data integration. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 1997.
....is of increasing importance. The extraction of the schema from a semistructured documents follows either unsupervised categorization or supervised categorization. The former we call clustering, the latter classification. Many researchers focus on classification of semistructured data or documents [4, 12]. One example of classification is summarization of documents [8] The approach presented in this paper is instead to cluster documents. In general, semistructured documents are clustered according to tagged elements. After a schema implicit in the cluster of semistructured documents is ....
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proc. Intl. Conf. on Very Large Data Bases,
....the bucket of any subgoal, the worst case complexity of this approach (in terms of planning time) is ### # # # #, as there can be # # distinct linear plans, and the cost of finding a feasible order for them using the approach in [LRO96] is ### # #. Executing top N Plans: More recent work [FKL97; NLF99; DL99] tried to make up for the prohibitive execution cost of the enumeration strategy used in [LRO96] by first ranking the enumerated plans in the order of their coverage (or more broadly quality ) and then executing the top N plans , for some arbitrarily chosen N. The idea is to ....
....set up latency; 3. For each mediated relation # # , its coverage in the source ## denoted by # #### # #, for example, # ############## denotes that source # stores 80 of the tuples of the mediated relation ############ ###### of all the sources in the data integration system. Following [NLF99, FKL97] we also make the simplifying assumption that the sources are independent in that the probability that a tuple is present in source # # is independent of the probability that the same tuple is present in # # . These assumptions are in line with the types of statistics used by previous work ....
[Article contains additional citation context not shown here]
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1997.
No context found.
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proc. of the Int. Conf. on Very Large Data Bases (VLDB), pages 216--225, Athens, Greece, 1997.
....judgments if they could be automatically pro A.Y. Levy, D.S. Weld Artificial Intelligence 118 (2000) 1 14 5 duced. A logical formulation of the (conditional or local) completeness of information sources is considered in [2,35,39,40,50,77,90] while a probabilistic formalism is developed in [45]. For the most part, these papers focus on algorithms for choosing optimally between sources, leaving the construction of such resource descriptions as an open problem. Motro and Rakov s work [89] is an exception; they suggest a combined manual statistical approach to rating databases, resulting ....
....since a negative answer from a complete source is meaningful, the data integration system can prune access to other sources. The problem of describing completeness of Web sources and using this information for query processing is addressed in [2,35,39,40,50,77,90] The work described in [45] describes a probabilistic formalism for describing the contents and overlaps among information sources, and presents algorithms for choosing optimally between sources. Differing query processing capabilities. From the perspective of the Web data integration system, the Web sources appear to have ....
D. Florescu, D. Koller, A. Levy, Using probabilistic information in data integration, in: Proc. International Conference on Very Large Data Bases (VLDB), Athens, Greece, 1997, pp. 216--225.
No context found.
Florescu, D., Koller, D., Levy, A.: Using probabilistic information in data integration. In: Proceedings of the International Conference on Very Large Databases (VLDB), Athens,
No context found.
Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: The VLDB Journal. (1997) 216--225
No context found.
Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: The VLDB Journal. (1997) 216--225
No context found.
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proc. of VLDB, 1997.
No context found.
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. In Proceeding of the International Conference on Very Large Data Bases (VLDB), 1997.
No context found.
Daniela Florescu, Daphne Koller, Alon Y. Levy, and Avi Pfeffer. Using probabilistic information in data integration. In Proceedings of VLDB-97, 1997.
No context found.
D. Florescu, D. Koller, and A. Levy. Using probabilistic information in data integration. Proceeding of VLDB., 1997.
No context found.
Daniela Florescu, Daphne Koller, Alon Y. Levy, and Avi Pfeffer. Using probabilistic information in data integration. In Proceedings of VLDB-97, 1997.
No context found.
Daniela Florescu, Daphne Koller, and Alon Levy. Using probabilistic information in data integration. In Proc. of VLDB, pages 216-225, Athens, Greece, 1997.
No context found.
Daniela Florescu, Daphne Koller, and Alon Y. Levy. Using Probabilistic Information in Data Integration. In Proceedings of the 23rd International Conference on Very Large Databases (VLDB), 1997.
No context found.
Daniela Florescu, Daphne Koller, and Alon Levy. Using probabilistic information in data integration. In VLDB'97, Proceedings of 23rd Interna- 173 tional Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 216--225, 1997.
No context found.
Daniela Florescu, Daphne Koller, Alon Y. Levy, and Avi Pfeffer. Using probabilistic information in data integration. In Proceedings of VLDB-97, 1997.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC