## Probabilistic data exchange (2010)

### Cached

### Download Links

Venue: | In Proc. ICDT |

Citations: | 30 - 5 self |

### BibTeX

@INPROCEEDINGS{Fagin10probabilisticdata,

author = {Ronald Fagin and Benny Kimelfeld and Phokion G. Kolaitis},

title = {Probabilistic data exchange},

booktitle = {In Proc. ICDT},

year = {2010}

}

### OpenURL

### Abstract

The work reported here lays the foundations of data exchange in the presence of probabilistic data. This requires rethinking the very basic concepts of traditional data exchange, such as solution, universal solution, and the certain answers of target queries. We develop a framework for data exchange over probabilistic databases, and make a case for its coherence and robustness. This framework applies to arbitrary schema mappings, and finite or countably infinite probability spaces on the source and target instances. After establishing this framework and formulating the key concepts, we study the application of the framework to a concrete and practical setting where probabilistic databases are compactly encoded by means of annotations formulated over random Boolean variables. In this setting, we study the problems of testing for the existence of solutions and universal solutions, materializing such solutions, and evaluating target queries (for unions of conjunctive queries) in both the exact sense and the approximate sense. For each of the problems, we carry out a complexity analysis based on properties of the annotation, in various classes of dependencies. Finally, we show that the framework and results easily and completely generalize to allow not only the data, but also the schema mapping itself to be probabilistic.

### Citations

2429 | Computational complexity - Papadimitriou - 1994 |

1194 | Geometric Algorithms and Combinatorial Optimization - Grotschel, Lov'asz, et al. - 1988 |

789 | Data integration: A theoretical perspective - LENZERINI |

528 |
The complexity of computing the permanent
- Valiant
- 1979
(Show Context)
Citation Context ... T. A fully polynomial randomized approximation scheme (ab9 A similar construction is used in [22] for the task of propagating trust conditions through data exchange between peers in a network. 10 #P =-=[47]-=- is the class of functions that count the number of accepting paths of the input of an NP machine. 11 Using an oracle to a #P-hard (or FP #P -hard) function, one can efficiently solve every problem in... |

474 |
Optimal implementation of conjunctive queries in relational data bases
- Chandra, Merlin
- 1977
(Show Context)
Citation Context ...inite 5 set of possible answers, which can be given to the user (along with the 4 Recall that there are fixed schemas over which testing ˜ J1 mat −→ is NP-hard even if ˜ J1 and ˜ J2 are deterministic =-=[8]-=-. 5 This is the case when ˜ K is a finite p-space. ˜ J2 confidence values); alternatively, the user may request k answers with the top probabilities [38]. Let (S,T,Σ) be a schema mapping and let Q be ... |

371 | Efficient Query Evaluation on Probabilistic Databases
- Dalvi, Suciu
- 2004
(Show Context)
Citation Context ... even an FPAS if source instances are tuple-independent. To prove these results, we use techniques for approximating the number of satisfying assignments for a DNF formula [29, 34] (as done in, e.g., =-=[12, 30]-=-). For the rest of the studied cases, there is always a schema mapping Σ and a UCQ Q such that no FPRAS exists unless RP = NP. Actually, this holds even if we fix the approximation ratio ɛ (that is, t... |

355 |
Incomplete information in relational databases
- Imielinski, Jr
- 1984
(Show Context)
Citation Context ...icitly specifying the whole probability space. 5.1 Annotated Instances We consider p-instances that are represented by means of Boolean pc-tables [24] (which are the probabilistic version of c-tables =-=[27]-=-) Fact f I α p re Researcher(Emma, UCSD) true Condition α(f) rj Researcher(John, UCSD) e1 ∨ e2 ∨ e3 ∨ e4 aeir RArea(Emma, IR) e1 ∨ e2 aedb RArea(Emma, DB) ¬e1 ∧ ¬e2 ajdb RArea(John, DB) e1 ∨ (¬e2 ∧ ¬e... |

355 |
Stochastic Orders and Their Applications
- Shaked, Shantikhumar
- 1993
(Show Context)
Citation Context ...ional concept (existence of a homomorphism) to p-instances. One definition is (again) in terms of a bivariate distribution, and the other two are based on the notion of a stochastic order (see, e.g., =-=[45]-=-). We show that the three are different from one another (and moreover, in the finite case, testing whether they hold belong to different complexity classes). So, we do not have one robust formalizati... |

340 | On representatives of subsets - Hall - 1935 |

338 | Data exchange: semantics and query answering
- Fagin, Kolaitis, et al.
- 2005
(Show Context)
Citation Context ...how next when we consider the concept of answering target queries, a universal p-solution is also characterized by its usefulness in answering target conjunctive queries (as in the deterministic case =-=[15]-=-). These results indicate that the concept of a universal p-solution is very robust. Since a solution in our framework (namely, a p-solution) is inherently probabilistic, evaluating target queries amo... |

283 |
Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science 43
- Jerrum, Valiant, et al.
- 1986
(Show Context)
Citation Context ...the (countable) set of all finite subsets Σ of Dep ST . 12 Note that the choice of the reliability factor 2/3 is arbitrary, since one can improve it to (1 − δ) by taking the median of O(log δ) trials =-=[28]-=-. Up until now, we considered schema mappings that are specified by triples (S,T,Σ) where Σ ∈ Dep ∗ ST. Here, as a starting point, we are interested in replacing the fixed Σ with a p-space ˜ Σ over De... |

235 | Graphs and Homomorphisms - Hell, Neˇsetˇril - 2004 |

231 | Applying model management to classical meta-data problems - BERNSTEIN |

230 |
The Management of Probabilistic Data
- Barbará, Garcia-Molina, et al.
- 1992
(Show Context)
Citation Context ... source instance are called solutions. Traditional data exchange is based on the assumption that source data are certain. However, the need to account for uncertainty in data has long been recognized =-=[4, 19]-=-. In view of the advent of the Web and related modern applications, models of uncertain data (typically probabilistic databases) have recently gained significant renewed focus [9–11,24,31,33,43,44]. I... |

201 | Translating web data - Popa, Velegrakis, et al. - 2002 |

189 | A probabilistic relational algebra for the integration of information retrieval and database systems
- Fuhr, Rölleke
- 1997
(Show Context)
Citation Context ... source instance are called solutions. Traditional data exchange is based on the assumption that source data are certain. However, the need to account for uncertainty in data has long been recognized =-=[4, 19]-=-. In view of the advent of the Web and related modern applications, models of uncertain data (typically probabilistic databases) have recently gained significant renewed focus [9–11,24,31,33,43,44]. I... |

150 | Limits to Parallel Computation. P-Completeness Theory - Greenlaw, Hoover, et al. - 1995 |

148 | Efficient top-k query evaluation on probabilistic data
- Re, Dalvi, et al.
- 2007
(Show Context)
Citation Context ...NP-hard even if ˜ J1 and ˜ J2 are deterministic [8]. 5 This is the case when ˜ K is a finite p-space. ˜ J2 confidence values); alternatively, the user may request k answers with the top probabilities =-=[38]-=-. Let (S,T,Σ) be a schema mapping and let Q be a k-ary query over T. In the deterministic case, answering Q means that, given a (deterministic) source instance I, we produce the certain answers, namel... |

146 | Data exchange: Getting to the core
- FAGIN, KOLAITIS, et al.
(Show Context)
Citation Context ...) are source and target p-instances, respectively. 5.2 Tuple/Equality-Generating Dependencies We consider two specific types of dependencies that were studied in past research on data exchange (e.g., =-=[15, 16]-=-); each dependency is a tuple-generating dependency (tgd) or an equality-generating dependency (egd) [5]. More particularly, let (S,T,Σ) be a schema mapping. A source-to-target tgd (st-tgd) is a formu... |

142 | Composing schema mappings: SecondOrder dependencies to the rescue - FAGIN, KOLAITIS, et al. - 2005 |

140 | Provenance semirings
- Green, Karvounarakis, et al.
- 2007
(Show Context)
Citation Context ...ing where dependencies are in the form of tgds and egds [5, 15] (the formal definitions are in Section 5.2), and pinstances are represented compactly by annotating facts with probabilistic conditions =-=[19,23,24]-=- rather than explicitly specifying the whole probability space. 5.1 Annotated Instances We consider p-instances that are represented by means of Boolean pc-tables [24] (which are the probabilistic ver... |

122 |
A proof procedure for data dependencies
- Beeri, Vardi
- 1984
(Show Context)
Citation Context ...tations of probabilistic databases. 5. COMPACT REPRESENTATION In this section, we explore complexity aspects of data exchange in a concrete setting where dependencies are in the form of tgds and egds =-=[5, 15]-=- (the formal definitions are in Section 5.2), and pinstances are represented compactly by annotating facts with probabilistic conditions [19,23,24] rather than explicitly specifying the whole probabil... |

116 | Testing implications of data dependencies
- Maier, Mendelzon, et al.
- 1979
(Show Context)
Citation Context ... candidate universal p-solutions, the intractable cases for source DNF instances remain intractable for tuple-independent instances. The positive results are obtained by combining the chase algorithm =-=[5,15,35]-=- with the known concept of maintaining conditions (or provenance) in relational operators, which is used in [23, 7 FP is the class of polynomial-time computable functions. 8 RP comprises the sets that... |

97 |
Sur les tableaux de corrélation dont les marges sont données. Annales de l’Université de Lyon, Sect
- Fréchet
- 1951
(Show Context)
Citation Context ...stance ˜J on the other. Our definition of a p-solution is based on the classical concept of a bivariate (joint) probability space with given marginals (research of this concept goes back to the 1950s =-=[18,36]-=-), but with the additional requirement that the support (i.e., the set of samples with a nonzero probability) is contained in a fixed relation (in this case, the source-solution relation). To explore ... |

93 |
The Complexity of Facets (and Some Facets of Complexity
- Papadimitriou, Yannakakis
- 1984
(Show Context)
Citation Context ...he notation ≼sp −→ and ≼ge ←− (rather than, e.g., ≼ ′ sp and ≼ ′ ge) is for clarity of presentation. 3 Recall that DP is the class of problems that can be formed as a difference of two problems in NP =-=[37]-=-.3. Testing ˜ J1 mat −→ is in NP. 4 4. Testing each of ˜ J1 ≼sp −→ instances ˜ J1 and ˜ ˜ J2, given two finite p-instances ˜ J1 and ˜ J2, ˜ J2 and ˜ J2 ≽ge −→ ˜ J1, given finite pJ2, is in EXPTIME an... |

93 | Data integration under integrity constraints - Calı̀, Calvanese, et al. - 2002 |

92 |
Inclusion dependencies and their interaction with functional dependencies
- Casanova, Fagin, et al.
- 1984
(Show Context)
Citation Context ...re transformations of individual facts. In 13 In [42], only the by-table type is considered. particular, the mappings of [13, 14, 41] for the by-tuple semantics are essentially inclusion dependencies =-=[7]-=-. 7. CONCLUSIONS In this paper, we developed a broad and flexible framework for data exchange over probabilistic data. For that, we had to consider the fundamental notions of traditional data exchange... |

90 |
Monte-carlo approximation algorithms for enumeration problems
- Karp, Luby, et al.
- 1989
(Show Context)
Citation Context ...nal algebra. 9 The lower bounds are proved using the inapproximability of determining the number of assignments satisfying a monotone 2-CNF formula (see, e.g., [49]), and the Monte-Carlo algorithm of =-=[29]-=- as a reduction technique. Answering target UCQs. The fourth problem is that of evaluating unions of conjunctive queries, and it corresponds to the rows of Table 1 entitled “Target UCQ: Exact.” Formal... |

90 | Reformulation of xml queries and constraints - Deutsch, Tannen - 2003 |

81 | Recursive unsolvability of a problem of Thue - POST - 1947 |

79 |
MYSTIQ: a system for finding more answers by using probabilities
- Boulos, Dalvi, et al.
- 2005
(Show Context)
Citation Context ...k to a concrete and practical setting, where the dependencies are from widely-studied classes, and where the probabilistic databases are compactly encoded in various conventional manners (e.g., as in =-=[2, 6, 10, 31, 43]-=-). Furthermore, in Section 6, we extend the framework and the results to allow the schema mapping (and the data) to be probabilistic. In principle, we could use this extended setting right from the be... |

79 | Data integration with uncertainty
- Dong, Halevy, et al.
- 2007
(Show Context)
Citation Context ...llowing correlations between the probabilistic source data and mappings to be represented. To the best of our knowledge, this work is the first to study data exchange over probabilistic databases. In =-=[13, 14, 41, 42]-=-, the problem of data exchange (and specifically data integration) for deterministic databases and probabilistic mappings is studied. The relationship between that work and this paper is discussed in ... |

77 | Trio: A system for data, uncertainty, and lineage - Agrawal, Benjelloun, et al. - 2006 |

76 | The Complexity of Relational Query Languages (Extended Abstract - Vardi |

72 | Xml data exchange: consistency and query answering - Arenas, Libkin - 2005 |

70 | Models for incomplete and probabilistic information
- Green, Tannen
- 2006
(Show Context)
Citation Context ...h each fact. Such a representation (along with some statistical assumptions) is typically logarithmic-scale compact. So, following existing representations (e.g., ULDBs [2,43], probabilistic c-tables =-=[24]-=- and probabilistic trees [44]), we explore a setting where the source p-instance is represented compactly by annotating facts with conditions, which are formulas over a set of (Boolean and probabilist... |

68 |
Counting classes are at least as hard as the polynomial-time hierarchy
- Toda, Ogiwara
- 1992
(Show Context)
Citation Context ... that count the number of accepting paths of the input of an NP machine. 11 Using an oracle to a #P-hard (or FP #P -hard) function, one can efficiently solve every problem in the polynomial hierarchy =-=[46]-=-.brev. FPRAS) for Q is a randomized algorithm A that gets as input a DNF instance I α p over S, a tuple a, and a number ɛ > 0, and returns a (random) value A(I α p ,a) such that ( p PrA 1 + ɛ ≤ A(Iα ... |

59 | Update exchange with mappings and provenance
- Green, Karvounarakis, et al.
- 2007
(Show Context)
Citation Context ... Table 1 entitled “Target UCQ: Approx.” Formally, let (S,T,Σ) be a schema mapping, and let Q be a UCQ over T. A fully polynomial randomized approximation scheme (ab9 A similar construction is used in =-=[22]-=- for the task of propagating trust conditions through data exchange between peers in a network. 10 #P [47] is the class of functions that count the number of accepting paths of the input of an NP mach... |

52 | Bootstrapping pay-as-yougo data integration systems
- Sarma, Dong, et al.
- 2008
(Show Context)
Citation Context ...llowing correlations between the probabilistic source data and mappings to be represented. To the best of our knowledge, this work is the first to study data exchange over probabilistic databases. In =-=[13, 14, 41, 42]-=-, the problem of data exchange (and specifically data integration) for deterministic databases and probabilistic mappings is studied. The relationship between that work and this paper is discussed in ... |

52 | The implication problem for data dependencies - BEERL, VARDI - 1981 |

49 | The dichotomy of conjunctive queries on probabilistic structures
- Dalvi, Suciu
- 2007
(Show Context)
Citation Context ...ion is in disjunctive normal form; in a tuple-independent instance different facts are probabilistically independent, and the annotation effectively specifies the probability of each fact, as done in =-=[6, 10, 11]-=-). Our analysis is based on data complexity, which is common in studying the complexity aspects of data exchange, (e.g., [15–17, 20]). Thus, we hold fixed a schema mapping and a query (when relevant),... |

46 | The complexity of query reliability
- Grädel, Gurevich, et al.
- 1998
(Show Context)
Citation Context ...where a trivial UCQ is a Boolean UCQ that is equivalent to true. To show this lower bound, we use hardness results of [11,12]; membership in FP #P is shown by adapting some of the techniques given in =-=[21]-=-. Given this intractability, the best that one can hope for when looking for tractable classes of schema mappings (in terms of targetquery evaluation) is an evaluation in an approximate manner; in pra... |

44 | Composition of mappings given by embedded dependencies - Nash, Bernstein, et al. - 2005 |

38 | Functional and Inclusion Dependencies: A Graph Theoretic Approach - Cosmadakis, Kanellakis - 1986 |

38 | Universal Algebra, Second Edition - Grätzer - 1979 |

37 | On Unapproximable Versions of NP-Complete Problems
- Zuckerman
- 1996
(Show Context)
Citation Context ...ure of annotated databases under relational algebra. 9 The lower bounds are proved using the inapproximability of determining the number of assignments satisfying a monotone 2-CNF formula (see, e.g., =-=[49]-=-), and the Monte-Carlo algorithm of [29] as a reduction technique. Answering target UCQs. The fourth problem is that of evaluating unions of conjunctive queries, and it corresponds to the rows of Tabl... |

35 | Clauses and Database Dependencies - Horn - 1982 |

31 | Exploiting lineage for confidence computation in uncertain and probabilistic databases
- Sarma, Theobald, et al.
- 2008
(Show Context)
Citation Context ...k to a concrete and practical setting, where the dependencies are from widely-studied classes, and where the probabilistic databases are compactly encoded in various conventional manners (e.g., as in =-=[2, 6, 10, 31, 43]-=-). Furthermore, in Section 6, we extend the framework and the results to allow the schema mapping (and the data) to be probabilistic. In principle, we could use this extended setting right from the be... |

30 | Materialized views in probabilistic databases for information exchange and query optimization - Re, Suciu - 2007 |

30 | On the complexity of managing probabilistic XML data
- Senellart, Abiteboul
(Show Context)
Citation Context ...ation (along with some statistical assumptions) is typically logarithmic-scale compact. So, following existing representations (e.g., ULDBs [2,43], probabilistic c-tables [24] and probabilistic trees =-=[44]-=-), we explore a setting where the source p-instance is represented compactly by annotating facts with conditions, which are formulas over a set of (Boolean and probabilistically independent) random ev... |