#### DMCA

## Approximate Lineage for Probabilistic Databases

### Cached

### Download Links

- [www.vldb.org]
- [pages.cs.wisc.edu]
- [pages.cs.wisc.edu]
- [www.cs.stanford.edu]
- [www.cs.stanford.edu]
- [www.cs.washington.edu]
- [homes.cs.washington.edu]
- [homes.cs.washington.edu]
- [pages.cs.wisc.edu]
- [www.cs.washington.edu]
- [www.cs.stanford.edu]
- [www.cs.stanford.edu]
- [homes.cs.washington.edu]
- [avid.cs.umass.edu]

Citations: | 34 - 9 self |

### Citations

2197 | Randomized Algorithms.
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...1 and ˜λ2 then, both ˜λ S approximations. are sufficient ε approximations are 2ε sufficient 1 and ˜λ S 2 1 ∧ ˜λ S 2 and ˜λ S 1 ∨ ˜λ S 2 This proposition is essentially an application of a union bound =-=[42]-=-. From this proposition and the fact that a query q that produces n tuples and has k subgoals has kn logical operations, we can conclude that if all lineage functions are εS approximations, then µ(q) ... |

1903 |
Causality: models, reasoning and inference.
- PEARL
- 2000
(Show Context)
Citation Context ... λt(A ⊕ {i})] (4) where ⊕ denotes the symmetric difference. This definition, or a closely related one, has appeared has appeared in wide variety of work, e.g. underling causality in the AI literature =-=[31, 44]-=-, influential variables in the learning literature [40], and critical tuples in the database literature [41, 47]. Example 2.12 What influence does x2 have on tuple t6 presence, i.e. what is the value ... |

1881 |
Foundations of Databases.
- Abiteboul, Hull, et al.
- 1995
(Show Context)
Citation Context ...nt in Dr. X’s proclamations. An important special case is when p(a) = 1, which indicates absolute certainty. Definition 2.4. Fix a set of atoms A. A probabilistic assignment p is a function from A to =-=[0, 1]-=- that assigns a probability score to each atom a ∈ A. A probabilistic database W is a probabilistic assignment p and a lineage function λ that represents a distribution µ over worlds defined as: ⎛ ⎞ ⎛... |

1127 | An introduction to variational methods for graphical models.
- Jordan, Ghahramani, et al.
- 1999
(Show Context)
Citation Context ...ed [47], but only with an exact semantics. Sen et al. [49] consider approximate processing of relational queries using graphical models, but not approximate lineage. In the graphical model literature =-=[15, 37]-=- approximate representation is considered, where the goal is to compress the model for improved performance. However, the data and query models of the our approaches is different. Specifically, our a... |

773 |
Probabilistic Networks and Expert Systems. Statistics for Engineering and Information Science.
- Cowell, Dawid, et al.
- 1999
(Show Context)
Citation Context ...ed [47], but only with an exact semantics. Sen et al. [49] consider approximate processing of relational queries using graphical models, but not approximate lineage. In the graphical model literature =-=[15, 37]-=- approximate representation is considered, where the goal is to compress the model for improved performance. However, the data and query models of the our approaches is different. Specifically, our a... |

748 | The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res.
- Boeckmann, Bairoch, et al.
- 2003
(Show Context)
Citation Context ... GO is a set of associations between proteins and their functions. These associations are integrated by GO from many sources, such as PubMed articles [45], raw experimental data, data from SWISS-PROT =-=[9]-=-, and automatically inferred matchings. GO tracks the provenance of each association, using what we call atoms. An atom is simply a tuple that contains a description of the source of a statement. An e... |

456 | Efficient query evaluation on probabilistic databases.
- Dalvi, Suciu
- 2004
(Show Context)
Citation Context ...UCTION In probabilistic databases, lineage is fundamental to processing probabilistic queries and understanding the data. Many state-ofthe-art systems use a complete approach, e.g. Trio [7] or Mystiq =-=[16, 46]-=-, in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In this paper, we observe that for many applications, it is often unnecessary for the system to painsta... |

417 |
Incomplete information in relational databases.
- Imielinski, Jr
- 1984
(Show Context)
Citation Context ...d a constant ‘a’. For a relational database W, we write W |= q to denote that W entails q. 2.2 Lineage and Probabilistic Databases In this section, we adopt a viewpoint of lineage similar to c-tables =-=[29, 36]-=-; we think of lineage as a constraint that tells us which worlds are possible. This viewpoint results in the standard possible worlds semantics for probabilistic databases [16, 20, 29]. Definition 2.1... |

359 | The MergePurge Problem for Large Databases,"
- Hernandez, Stolfo
- 1995
(Show Context)
Citation Context ...lication (2): Managing Similarity Scores Applications that manage similarity scores can benefit from approximate lineage. Such applications include managing data from object reconciliation procedures =-=[3, 34]-=- or similarity scores between users, such as iLike.com. In iLike, the system automatically assigns a music compatibility score between friends. The similarity score between two users, e.g. Bob and Joe... |

327 |
Constant depth circuits, fourier transform, and learnability,”
- Linial, Mansour, et al.
- 1993
(Show Context)
Citation Context ... atoms induced by the probability function p. Our goal is to ensure that the lineage function for every tuple in the database has an ε-approximation. Def. 2.10 is used in computational learning, e.g. =-=[40, 50]-=-, because an ε-approximation of a function disagrees only a few inputs: Example 2.11 Let y1 and y2 be atoms such that p(yi) = 0.5 for i = 1, 2. Consider the lineage function for some t, λt(y1, y2) = y... |

310 | ULDBS: Databases with Uncertainty and Lineage.
- Benjelloun, Sarma, et al.
- 2006
(Show Context)
Citation Context ... a tuple t, that is, a monotone k-DNF formula λt, our goal is to efficiently find a sufficient lineage ˜λ S t that is small and is an ε-approximation of λt (Def. 2.10). This differs from L-minimality =-=[6]-=- that looks for a formula that is equivalent, but smaller. In contrast, we look for a formula that may only approximate the original formula. More formally, the size of a sufficient lineage ˜λ S t is ... |

268 | Trio: A System for Integrated Management of Data Accuracy,
- Widom
- 2005
(Show Context)
Citation Context .... However, this approach is not lineage aware and so cannot extract explanations from the compressed data. In probabilistic databases, lineage is used for query processing in Mystiq [16, 46] and Trio =-=[54]-=-. However, neither considers approximate lineage. Ré et al. [46] consider approximately computing the probability of a query answer, but do not consider the problem of storing the lineage of a query a... |

258 | A logic for reasoning about probabilities.
- Fagin, Halpern, et al.
- 1990
(Show Context)
Citation Context ...similar to c-tables [29, 36]; we think of lineage as a constraint that tells us which worlds are possible. This viewpoint results in the standard possible worlds semantics for probabilistic databases =-=[16, 20, 29]-=-. Definition 2.1 (Lineage Function). An atom is a Boolean proposition about the real world, e.g. Bob likes Herbie Hancock. Fix a relational schema σ and a set of atoms A. A lineage function, λ, assign... |

227 |
C-store: a column-oriented dbms.
- Stonebraker, Abadi, et al.
- 2005
(Show Context)
Citation Context ...systems [13]. Of these, only [30] considers probabilistic data, but not approximate semantics. There is long, successful line of work that compresses (deterministic) data to speed up query processing =-=[18, 23, 25, 51, 53]-=-. In wavelet approaches, probabilistic techniques are used to achieve a higher quality synopses, [18]. In contrast, lineage in our setting contains probabilities, which must be captured. The fact that... |

212 | A probabilistic relational algebra for the integration of information retrieval and database systems.
- Fuhr, Rolleke
- 1997
(Show Context)
Citation Context ...lineage formula. Processing a query q on a database with lineage boils down to building a lineage expression for q by combining the lineage functions of individual tuples, i.e. intensional evaluation =-=[22, 46]-=-. For example, a join producing a tuple t from t1 and t2 produces lineage for t, λt = λt1 ∧ λt2 . We first prove that the error in processing a query q is upper bounded by the number of lineage functi... |

207 | Learning decision trees using the Fourier spectrum.
- Kushilevitz, Mansour
- 1993
(Show Context)
Citation Context ... λt is the s largest coefficients in absolute value, ties broken arbitrarily. 4.2 Constructing Lineage We construct polynomial lineage by searching for the largest coefficients using the KM algorithm =-=[39]-=-. The KM algorithm is complete in the sense that if there is an (s, ε) sparse approximation it finds an only slightly worse (s, ε + ε 2 /s) approximation. The key technical insight, is that k-DNFs do ... |

201 | Causes and Explanations: A StructuralModel Approach: Part 1. The British Journal for the Philosophy of Science,
- Halpern, Pearl
- 2005
(Show Context)
Citation Context ... λt(A ⊕ {i})] (4) where ⊕ denotes the symmetric difference. This definition, or a closely related one, has appeared has appeared in wide variety of work, e.g. underling causality in the AI literature =-=[31, 44]-=-, influential variables in the learning literature [40], and critical tuples in the database literature [41, 47]. Example 2.12 What influence does x2 have on tuple t6 presence, i.e. what is the value ... |

201 |
Computational Limitations of Small-depth Circuits.
- Hastad
- 1987
(Show Context)
Citation Context ...omial lineage is based on computational learning techniques, such as the seminal paper by Linial et al. [40], and others, [8, 11, 43]. A key ingredient underlying these results are switching lemmata, =-=[5, 32, 48]-=-. So far, learning techniques have only been applied to compressing the data, but have not compressed the lineage [4, 24]. A difference between our approach and this prior art is that we do not discar... |

198 | Approximate computation of multidimensional aggregates of sparse data using wavelets.
- Vitter, Wang
- 1999
(Show Context)
Citation Context ...systems [13]. Of these, only [30] considers probabilistic data, but not approximate semantics. There is long, successful line of work that compresses (deterministic) data to speed up query processing =-=[18, 23, 25, 51, 53]-=-. In wavelet approaches, probabilistic techniques are used to achieve a higher quality synopses, [18]. In contrast, lineage in our setting contains probabilities, which must be captured. The fact that... |

196 | Provenance semirings.
- Green, Karvounarakis, et al.
- 2007
(Show Context)
Citation Context ...TED WORK Lineage systems and provenance are important topics in data management, [12, 13, 17, 28]. Compressing lineage is cited as an important techinque to scaling these systems [13]. Of these, only =-=[30]-=- considers probabilistic data, but not approximate semantics. There is long, successful line of work that compresses (deterministic) data to speed up query processing [18, 23, 25, 51, 53]. In wavelet ... |

182 | Efficient top-k query evaluation on probabilistic data. In
- Re, Dalvi, et al.
- 2007
(Show Context)
Citation Context ...UCTION In probabilistic databases, lineage is fundamental to processing probabilistic queries and understanding the data. Many state-ofthe-art systems use a complete approach, e.g. Trio [7] or Mystiq =-=[16, 46]-=-, in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In this paper, we observe that for many applications, it is often unnecessary for the system to painsta... |

145 | Ganti,"Eliminating fuzzy duplicates in data warehouses",
- Ananthakrishna, Chaudhuri, et al.
- 2002
(Show Context)
Citation Context ...lication (2): Managing Similarity Scores Applications that manage similarity scores can benefit from approximate lineage. Such applications include managing data from object reconciliation procedures =-=[3, 34]-=- or similarity scores between users, such as iLike.com. In iLike, the system automatically assigns a music compatibility score between friends. The similarity score between two users, e.g. Bob and Joe... |

144 |
Monte-carlo algorithms for enumeration and reliability problems.
- Karp, Luby
- 1983
(Show Context)
Citation Context ... , a sufficient ε-approximation. For simplicity, we assume that we can compute the expectation of monotone formula exactly. In practice, we estimate this quantity using sampling, e.g. using Luby-Karp =-=[38]-=-. The algorithm has two cases: In case (I) on lines 2-4, there is a large matching, that is, a set of monomials M such that distinct monomials in M do not contain common variables. For example, in the... |

142 | Representing and Querying Correlated Tuples in Probabilistic Databases.
- Sen, Deshpande
- 2007
(Show Context)
Citation Context ...tabase using sufficient lineage. Approximate lineage is used to materialize views of probabilistic data; this problem has been previously considered [47], but only with an exact semantics. Sen et al. =-=[49]-=- consider approximate processing of relational queries using graphical models, but not approximate lineage. In the graphical model literature [15, 37] approximate representation is considered, where t... |

130 | Weakly learning DNF and characterizing statistical query learning using Fourier analysis.
- Blum, Furst, et al.
- 1994
(Show Context)
Citation Context ...we can answer queries with low-error 10 −3 , 2 orders of magnitude more quickly than a complete approach. For polynomial lineage, we are able to directly adapt techniques form the literature, such as =-=[8]-=-. 2.4 Discussion The acquisition of atoms and trust policies is an interesting future research direction. Since our focus is on large databases, it is impractical to require users to label each atom m... |

122 | Provenance Management in Curated Databases.
- Buneman, Chapman, et al.
- 2006
(Show Context)
Citation Context ...e, e.g. 100s of MB to 1MB, while providing high-quality explanations. Application (1): Large Scientific databases In large scientific databases, lineage is used to integrate data from several sources =-=[12]-=-. These sources are combined by both large consortia, e.g. [14], and single research groups. A key challenge faced by scientists is that facts from different sources may not be trusted equally. For ex... |

106 | A Formal Analysis of Information Disclosure in Data Exchange. - Miklau, Suciu - 2006 |

98 | Selectivity estimation using probabilistic models.
- Getoor, Taskar, et al.
- 2001
(Show Context)
Citation Context ... 11, 43]. A key ingredient underlying these results are switching lemmata, [5, 32, 48]. So far, learning techniques have only been applied to compressing the data, but have not compressed the lineage =-=[4, 24]-=-. A difference between our approach and this prior art is that we do not discard any tuples, but may discard lineage. Explanation is a well-studied topic in the Artificial Intelligence community, see ... |

96 |
MYSTIQ: a system for finding more answers by using probabilities.
- Boulos, Dalvi, et al.
- 2005
(Show Context)
Citation Context ...Gb of RAM. Our prototype implementation of the compression algorithms was written in approximately 2000 lines of Caml. Query performance was done using a modified C++/caml version of the Mystiq engine=-=[10]-=- backed by databases running SQL Server 2005. The implementation was not heavily optimized. 5.2 Compression We verify that our compression algorithms produce small approximate lineage, even for string... |

83 | Models for incomplete and probabilistic information.
- Green, Tannen
- 2006
(Show Context)
Citation Context ...d a constant ‘a’. For a relational database W, we write W |= q to denote that W entails q. 2.2 Lineage and Probabilistic Databases In this section, we adopt a viewpoint of lineage similar to c-tables =-=[29, 36]-=-; we think of lineage as a constraint that tells us which worlds are possible. This viewpoint results in the standard possible worlds semantics for probabilistic databases [16, 20, 29]. Definition 2.1... |

61 | Compressing relations and indexes. In:
- Goldstein, Ramakrishnan, et al.
- 1998
(Show Context)
Citation Context ...systems [13]. Of these, only [30] considers probabilistic data, but not approximate semantics. There is long, successful line of work that compresses (deterministic) data to speed up query processing =-=[18, 23, 25, 51, 53]-=-. In wavelet approaches, probabilistic techniques are used to achieve a higher quality synopses, [18]. In contrast, lineage in our setting contains probabilities, which must be captured. The fact that... |

60 | On the Fourier spectrum of monotone functions.
- Bshouty, Tamon
- 1996
(Show Context)
Citation Context ...ct that lineage is database is often internal. Our approach to computing polynomial lineage is based on computational learning techniques, such as the seminal paper by Linial et al. [40], and others, =-=[8, 11, 43]-=-. A key ingredient underlying these results are switching lemmata, [5, 32, 48]. So far, learning techniques have only been applied to compressing the data, but have not compressed the lineage [4, 24].... |

50 | The Complexity of Query Reliability.
- Gradel, Gurevich, et al.
- 1998
(Show Context)
Citation Context ...xponential in the size of the data. Probabilistic query evaluation can be reduced to calculating a single coefficient of the transform, which implies exact computation of the transform is intractable =-=[16, 27]-=-. Aref et al. [19] advocate an approach to operate directly on compressed data to optimze queries on Biological sequences. However, this approach is not lineage aware and so cannot extract explanation... |

49 |
A Switching Lemma Primer,
- Beame
- 1994
(Show Context)
Citation Context ...omial lineage is based on computational learning techniques, such as the seminal paper by Linial et al. [40], and others, [8, 11, 43]. A key ingredient underlying these results are switching lemmata, =-=[5, 32, 48]-=-. So far, learning techniques have only been applied to compressing the data, but have not compressed the lineage [4, 24]. A difference between our approach and this prior art is that we do not discar... |

44 | Extended wavelets for multiple measures
- Deligiannakis, Garofalakis, et al.
(Show Context)
Citation Context |

44 | A switching lemma for small restrictions and lower bounds for k-DNF resolution.
- SEGERLIND, BUSS, et al.
- 2002
(Show Context)
Citation Context ...omial lineage is based on computational learning techniques, such as the seminal paper by Linial et al. [40], and others, [8, 11, 43]. A key ingredient underlying these results are switching lemmata, =-=[5, 32, 48]-=-. So far, learning techniques have only been applied to compressing the data, but have not compressed the lineage [4, 24]. A difference between our approach and this prior art is that we do not discar... |

41 | An introduction to ULDBs and the Trio system
- Benjelloun, Sarma, et al.
- 2006
(Show Context)
Citation Context ...set. 1. INTRODUCTION In probabilistic databases, lineage is fundamental to processing probabilistic queries and understanding the data. Many state-ofthe-art systems use a complete approach, e.g. Trio =-=[7]-=- or Mystiq [16, 46], in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In this paper, we observe that for many applications, it is often unnecessary for th... |

40 |
A SAGE approach to discovery of genes involved in autophagic cell death.
- Gorski, Chittaranjan, et al.
- 2003
(Show Context)
Citation Context ...rticular annotation. Fig. 1 illustrates such a database. Example 1.1 A statement derivable from GO is, “Dr. X claimed in PubMed PMID:12593804 that the gene Argonaute2 (AGO2) is involved in cell death”=-=[26]-=-. In our model, one way to view this is that there is a fact, the gene Argonaute2 is involved in cell death and there is an atom, Dr. X made the claim in PubMed PMID:12593804. If we trust Dr. X, then ... |

38 |
SPARTAN: A Model-Based Semantic Compression System for Massive Data Tables
- Babu, Garofalakis, et al.
- 2001
(Show Context)
Citation Context ... 11, 43]. A key ingredient underlying these results are switching lemmata, [5, 32, 48]. So far, learning techniques have only been applied to compressing the data, but have not compressed the lineage =-=[4, 24]-=-. A difference between our approach and this prior art is that we do not discard any tuples, but may discard lineage. Explanation is a well-studied topic in the Artificial Intelligence community, see ... |

35 | Orchestra: facilitating collaborative data sharing.
- Green, Karvounarakis, et al.
- 2007
(Show Context)
Citation Context ...ive time performance gain. In this example both running times scale approximately with the size of compression. 6. RELATED WORK Lineage systems and provenance are important topics in data management, =-=[12, 13, 17, 28]-=-. Compressing lineage is cited as an important techinque to scaling these systems [13]. Of these, only [30] considers probabilistic data, but not approximate semantics. There is long, successful line ... |

33 | Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization
- Re, Suciu
- 2007
(Show Context)
Citation Context ... the top-k query probabilities from the Level II database using sufficient lineage. Approximate lineage is used to materialize views of probabilistic data; this problem has been previously considered =-=[47]-=-, but only with an exact semantics. Sen et al. [49] consider approximate processing of relational queries using graphical models, but not approximate lineage. In the graphical model literature [15, 37... |

32 | On learning monotone DNF under product distributions
- Servedio
- 2001
(Show Context)
Citation Context ... atoms induced by the probability function p. Our goal is to ensure that the lineage function for every tuple in the database has an ε-approximation. Def. 2.10 is used in computational learning, e.g. =-=[40, 50]-=-, because an ε-approximation of a function disagrees only a few inputs: Example 2.11 Let y1 and y2 be atoms such that p(yi) = 0.5 for i = 1, 2. Consider the lineage function for some t, λt(y1, y2) = y... |

29 | bdbms: A database management system for biological data. In
- Eltabakh, Ouzzani, et al.
- 2007
(Show Context)
Citation Context ... of the data. Probabilistic query evaluation can be reduced to calculating a single coefficient of the transform, which implies exact computation of the transform is intractable [16, 27]. Aref et al. =-=[19]-=- advocate an approach to operate directly on compressed data to optimze queries on Biological sequences. However, this approach is not lineage aware and so cannot extract explanations from the compres... |

28 |
P.B.: Probabilistic Wavelet Synopses
- Garofalakis, Gibbons
- 2004
(Show Context)
Citation Context |

25 | On the complexity of approximating k-dimensional matching
- Hazan, Safra, et al.
- 2003
(Show Context)
Citation Context ...λt is NP-Hard, even if k = 3. The reduction is from of finding a matching in a k-uniform kregular hypergraph. The greedy algorithm is essentially an optimal approximation for this hypergraph matching =-=[33]-=-. Since our problem appears to be more difficult, this suggests – but does not prove – that our greedy algorithm may be close to optimal. 3.3 Understanding Sufficient Lineage Both Prob. 2, finding suf... |

23 | Computational applications of noise sensitivity
- O’Donnell
- 2003
(Show Context)
Citation Context ...for i = 1, . . . , n, σ 2 i = p(xi)(1 − p(xi)) (the variance of xi). Our goal is to get a small, but good approximation; we make this goal precise using sparse Fourier series: 1 For more details, see =-=[40, 43]-=- isDefinition 4.4. An s-sparse series is a Fourier series with at most s non-zero coefficients. We say λ has an (s, ε) approximation if there exists an s-sparse approximation ˜λ P t such that � � �λt ... |

13 | Issues in Building Practical Provenance Systems
- Chapman, Jagadish
(Show Context)
Citation Context ...value in a complete approach, we may be forced to process all the lineage for a given tuple. This is challenging, because the lineage can be very large. This problem is not unique to GO. For example, =-=[13]-=- reports that a 250MB biological database has 6GB of lineage. In this work, we show how to use approximate lineage to effectively compress the lineage more than two orders of magnitude, even for extre... |

6 | Representing uncertain data: Uniqueness, equivalence, minimization, and approximation
- Sarma, Nabar, et al.
- 2005
(Show Context)
Citation Context ...ent lineage. Approximate lineage is used to materialize views of probabilistic data; this problem has been previously considered [47], but only with an exact semantics. The notion of approximation in =-=[48]-=- is not the same as ours: They do not consider error guarantees, nor explanations. Sen et al. [50] consider approximate processing of relational queries using graphical models, but not approximate lin... |

6 | A note on deterministic approximate counting for k-DNF
- Trevisan
- 2004
(Show Context)
Citation Context ...s, [8, 11, 43]. A key ingredient underlying these results are switching lemmata, [5, 32, 49]. For the problem of sufficient lineage, we use use the implicit in both Segerlind et al. [49] and Trevisan =-=[54]-=- that either a few variables in a DNF matter (hit every clause) or the formula is ε large. The most famous (and sharpest) switching lemma due to Håstad [32] underlies the Fourier results. So far, lear... |

2 |
database v
- Go
(Show Context)
Citation Context ...tude even with very low error, (Sec. 5.2), provide high quality explanations (Sec. 5.3) and provide large performance improvements (Sec. 5.4). Our experiments use data from the Gene Ontology database =-=[14, 52]-=- and a probabilistic database of IMDB [35] linked with reviews from Amazon. We discuss related work in Sec. 6 and conclude in Sec. 7. 2. STATEMENT OF RESULTS We first give some background on lineage a... |

1 |
Approximate lineage for probabilistic datbases
- Ré, Suciu
- 2008
(Show Context)
Citation Context ...ean formula ˜λ N t , such that λt =⇒ ˜λ N t is a tautology. Similar properties hold for necessary lineage, e.g. ˜µ N is an upper bound for µ, but we found that it is less well-suited for explanations =-=[46]-=-. Polynomial Lineage In contrast to both standard and sufficient lineages that map each tuple to a Boolean function, polynomial approximate lineage maps each tuple to a real-valued function. This gene... |