Results 1  10
of
41
Trio: A system for data, uncertainty, and lineage
 in VLDB
, 2006
"... In the Trio project at Stanford, we are building a new kind of database management system: one in which data, uncertainty of the data, and data lineage are all firstclass citizens in an extended relational model and SQLbased query language. ..."
Abstract

Cited by 108 (5 self)
 Add to MetaCart
(Show Context)
In the Trio project at Stanford, we are building a new kind of database management system: one in which data, uncertainty of the data, and data lineage are all firstclass citizens in an extended relational model and SQLbased query language.
Conditioning Probabilistic Databases
"... Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus ..."
Abstract

Cited by 65 (13 self)
 Add to MetaCart
(Show Context)
Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus to transform a probabilistic database of priors into a posterior probabilistic database which is materialized for subsequent query processing or further refinement. It turns out that the conditioning problem is closely related to the problem of computing exact tuple confidence values. It is known that exact confidence computation is an NPhard problem. This has lead researchers to consider approximation techniques for confidence computation. However, neither conditioning nor exact confidence computation can be solved using such techniques. In this paper we present efficient techniques for both problems. We study several problem decomposition methods and heuristics that are based on the most successful search techniques from constraint satisfaction, such as the variable elimination rule of the DavisPutnam algorithm. We complement this with a thorough experimental evaluation of the algorithms proposed. Our experiments show that our exact algorithms scale well to realistic database sizes and can in some scenarios compete with the most efficient previous approximation algorithms.
Community information management
, 2006
"... We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage datarich online communities. We first describe the envisioned working of such a software platform ..."
Abstract

Cited by 52 (11 self)
 Add to MetaCart
(Show Context)
We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage datarich online communities. We first describe the envisioned working of such a software platform and our prototype, DBLife, which is a community portal being developed for the database research community. We then describe the technical challenges in Cimple and our solution approach. Finally, we discuss managing uncertainty and provenance, a crucial task in making our software platform practical. 1
From Complete to Incomplete Information and Back
 In Proc. SIGMOD
"... Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semanti ..."
Abstract

Cited by 37 (11 self)
 Add to MetaCart
Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semantics such as certain or ranked possible answers. There are now also languages with special features to deal with uncertainty. However, to the standards of the data management community, to date no language proposal has been made that can be considered a natural analog to SQL or relational algebra for the case of incomplete information. In this paper we propose such a language, Worldset Algebra, which satisfies the robustness criteria and analogies to relational algebra that we expect. The language supports the contemplation on alternatives and can thus map from a complete database to an incomplete one comprising several possible worlds. We show that Worldset Algebra is conservative over relational algebra in the sense that any query that maps from a complete database to a complete database (a completetocomplete query) is equivalent to a relational algebra query. Moreover, we give an efficient algorithm for effecting this translation. We then study algebraic query optimization of such queries. We argue that query languages with explicit constructs for handling uncertainty allow for the more natural and simple expression of many realworld decision support queries. The results of this paper not only suggest a language for specifying queries in this way, but also allow for their efficient evaluation in any relational database management system.
Approximate Lineage for Probabilistic Databases
"... In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can b ..."
Abstract

Cited by 34 (9 self)
 Add to MetaCart
(Show Context)
In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set. 1.
MayBMS: A System for Managing Large Uncertain and Probabilistic Databases
 Managing and Mining Uncertain Data, chapter 6
, 2008
"... MayBMS is a stateoftheart probabilistic database management system that has been built as an extension of Postgres, an opensource relational database management system. MayBMS follows a principled approach to leveraging the strengths of previous database research for achieving scalability. This ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
(Show Context)
MayBMS is a stateoftheart probabilistic database management system that has been built as an extension of Postgres, an opensource relational database management system. MayBMS follows a principled approach to leveraging the strengths of previous database research for achieving scalability. This article describes the main goals of this project, the design of query and update language, efficient exact and approximate query processing, and algorithmic and systems aspects.
Perm: Processing Provenance and Data on the same Data Model through Query Rewriting
 In ICDE ’09: Proceedings of the 25th International Conference on Data Engineering
, 2009
"... Abstract — Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data it ..."
Abstract

Cited by 32 (8 self)
 Add to MetaCart
Abstract — Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations. I.
Approximate Confidence Computation in Probabilistic Databases
"... Abstract—This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithmisbasedonanincrementalcompilationofformulasinto decision diagrams using three types of decompositions: Sha ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
Abstract—This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithmisbasedonanincrementalcompilationofformulasinto decision diagrams using three types of decompositions: Shannon expansion, independence partitioning, and product factorization. With each decomposition step, lower and upper bounds on the probability of the partially compiled formula can be quickly computed and checked against the allowed error. This algorithm can be effectively used to compute approximate confidence values of answer tuples to positive relational algebra queries on general probabilistic databases (ctables with discrete probability distributions). We further tune our algorithm so as to capture all known tractable conjunctive queries without selfjoins on tupleindependent probabilistic databases: In this case, the algorithm requires time polynomial in the input size even for exact computation. We implementedthealgorithm as anextension of theSPROUT query engine. An extensive experimental effort shows that it consistently outperforms stateofart approximation techniques by several orders of magnitude. I.
TrioOne: Layering uncertainty and lineage on a conventional DBMS
 IN PROC. OF CONFERENCE ON INNOVATIVE DATA SYSTEMS RESEARCH (CIDR
, 2007
"... Trio is a new kind of database system that supports data, uncertainty, and lineage in a fully integrated manner. The first Trio prototype, dubbed TrioOne, is built on top of a conventional DBMS using data and query translation techniques together with a small number of stored procedures. This paper ..."
Abstract

Cited by 25 (10 self)
 Add to MetaCart
Trio is a new kind of database system that supports data, uncertainty, and lineage in a fully integrated manner. The first Trio prototype, dubbed TrioOne, is built on top of a conventional DBMS using data and query translation techniques together with a small number of stored procedures. This paper describes TrioOne’s translation scheme and system architecture, showing how it efficiently and easily supports the Trio data model and query language.
Anonymized data: generation, models, usage
 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD
, 2009
"... Data anonymization techniques have been the subject of intense investigation in recent years, for many kinds of structured data, including tabular, item set and graph data. They enable publication of detailed information, which permits ad hoc queries and analyses, while guaranteeing the privacy of ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
(Show Context)
Data anonymization techniques have been the subject of intense investigation in recent years, for many kinds of structured data, including tabular, item set and graph data. They enable publication of detailed information, which permits ad hoc queries and analyses, while guaranteeing the privacy of sensitive information in the data against a variety of attacks. In this tutorial, we aim to present a unified framework of data anonymization techniques, viewed through the lens of data uncertainty. Essentially, anonymized data describes a set of possible worlds, one of which corresponds to the original data. We show that anonymization approaches such as suppression, generalization, perturbation and permutation generate different working models of uncertain data, some of which have been well studied, while others open new directions for research. We demonstrate that the privacy guarantees offered by methods such as kanonymization and `diversity can be naturally understood in terms of similarities and differences in the sets of possible worlds that correspond to the anonymized data. We describe how the body of work in query evaluation over uncertain databases can be used for answering ad hoc queries over anonymized data in a principled manner. A key benefit of the unified approach is the identification of a rich set of new problems for both the Data Anonymization and the Uncertain Data communities.