Results 1 - 10
of
41
Trio: A system for data, uncertainty, and lineage
- in VLDB
, 2006
"... In the Trio project at Stanford, we are building a new kind of database management system: one in which data, uncertainty of the data, and data lineage are all first-class citizens in an extended relational model and SQL-based query language. ..."
Abstract
-
Cited by 108 (5 self)
- Add to MetaCart
(Show Context)
In the Trio project at Stanford, we are building a new kind of database management system: one in which data, uncertainty of the data, and data lineage are all first-class citizens in an extended relational model and SQL-based query language.
Conditioning Probabilistic Databases
"... Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus ..."
Abstract
-
Cited by 65 (13 self)
- Add to MetaCart
(Show Context)
Past research on probabilistic databases has studied the problem of answering queries on a static database. Application scenarios of probabilistic databases however often involve the conditioning of a database using additional information in the form of new evidence. The conditioning problem is thus to transform a probabilistic database of priors into a posterior probabilistic database which is materialized for subsequent query processing or further refinement. It turns out that the conditioning problem is closely related to the problem of computing exact tuple confidence values. It is known that exact confidence computation is an NPhard problem. This has lead researchers to consider approximation techniques for confidence computation. However, neither conditioning nor exact confidence computation can be solved using such techniques. In this paper we present efficient techniques for both problems. We study several problem decomposition methods and heuristics that are based on the most successful search techniques from constraint satisfaction, such as the variable elimination rule of the Davis-Putnam algorithm. We complement this with a thorough experimental evaluation of the algorithms proposed. Our experiments show that our exact algorithms scale well to realistic database sizes and can in some scenarios compete with the most efficient previous approximation algorithms.
Community information management
, 2006
"... We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. We first describe the envisioned working of such a software platform ..."
Abstract
-
Cited by 52 (11 self)
- Add to MetaCart
(Show Context)
We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. We first describe the envisioned working of such a software platform and our prototype, DBLife, which is a community portal being developed for the database research community. We then describe the technical challenges in Cimple and our solution approach. Finally, we discuss managing uncertainty and provenance, a crucial task in making our software platform practical. 1
From Complete to Incomplete Information and Back
- In Proc. SIGMOD
"... Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semanti ..."
Abstract
-
Cited by 37 (11 self)
- Add to MetaCart
Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semantics such as certain or ranked possible answers. There are now also languages with special features to deal with uncertainty. However, to the standards of the data management community, to date no language proposal has been made that can be considered a natural analog to SQL or relational algebra for the case of incomplete information. In this paper we propose such a language, World-set Algebra, which satisfies the robustness criteria and analogies to relational algebra that we expect. The language supports the contemplation on alternatives and can thus map from a complete database to an incomplete one comprising several possible worlds. We show that World-set Algebra is conservative over relational algebra in the sense that any query that maps from a complete database to a complete database (a complete-to-complete query) is equivalent to a relational algebra query. Moreover, we give an efficient algorithm for effecting this translation. We then study algebraic query optimization of such queries. We argue that query languages with explicit constructs for handling uncertainty allow for the more natural and simple expression of many real-world decision support queries. The results of this paper not only suggest a language for specifying queries in this way, but also allow for their efficient evaluation in any relational database management system.
Approximate Lineage for Probabilistic Databases
"... In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can b ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
(Show Context)
In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set. 1.
MayBMS: A System for Managing Large Uncertain and Probabilistic Databases
- Managing and Mining Uncertain Data, chapter 6
, 2008
"... MayBMS is a state-of-the-art probabilistic database management system that has been built as an extension of Postgres, an open-source relational database management system. MayBMS follows a principled approach to leveraging the strengths of previous database research for achieving scalability. This ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
(Show Context)
MayBMS is a state-of-the-art probabilistic database management system that has been built as an extension of Postgres, an open-source relational database management system. MayBMS follows a principled approach to leveraging the strengths of previous database research for achieving scalability. This article describes the main goals of this project, the design of query and update language, efficient exact and approximate query processing, and algorithmic and systems aspects.
Perm: Processing Provenance and Data on the same Data Model through Query Rewriting
- In ICDE ’09: Proceedings of the 25th International Conference on Data Engineering
, 2009
"... Abstract — Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data it ..."
Abstract
-
Cited by 32 (8 self)
- Add to MetaCart
Abstract — Data provenance is information that describes how a given data item was produced. The provenance includes source and intermediate data as well as the transformations involved in producing the concrete data item. In the context of a relational databases, the source and intermediate data items are relations, tuples and attribute values. The transformations are SQL queries and/or functions on the relational data items. Existing approaches capture provenance information by extending the underlying data model. This has the intrinsic disadvantage that the provenance must be stored and accessed using a different model than the actual data. In this paper, we present an alternative approach that uses query rewriting to annotate result tuples with provenance information. The rewritten query and its result use the same model and can, thus, be queried, stored and optimized using standard relational database techniques. In the paper we formalize the query rewriting procedures, prove their correctness, and evaluate a first implementation of the ideas using PostgreSQL. As the experiments indicate, our approach efficiently provides provenance information inducing only a small overhead on normal operations. I.
Approximate Confidence Computation in Probabilistic Databases
"... Abstract—This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithmisbasedonanincrementalcompilationofformulasinto decision diagrams using three types of decompositions: Sha ..."
Abstract
-
Cited by 28 (5 self)
- Add to MetaCart
(Show Context)
Abstract—This paper introduces a deterministic approximation algorithm with error guarantees for computing the probability of propositional formulas over discrete random variables. The algorithmisbasedonanincrementalcompilationofformulasinto decision diagrams using three types of decompositions: Shannon expansion, independence partitioning, and product factorization. With each decomposition step, lower and upper bounds on the probability of the partially compiled formula can be quickly computed and checked against the allowed error. This algorithm can be effectively used to compute approximate confidence values of answer tuples to positive relational algebra queries on general probabilistic databases (c-tables with discrete probability distributions). We further tune our algorithm so as to capture all known tractable conjunctive queries without selfjoins on tuple-independent probabilistic databases: In this case, the algorithm requires time polynomial in the input size even for exact computation. We implementedthealgorithm as anextension of theSPROUT query engine. An extensive experimental effort shows that it consistently outperforms state-of-art approximation techniques by several orders of magnitude. I.
Trio-One: Layering uncertainty and lineage on a conventional DBMS
- IN PROC. OF CONFERENCE ON INNOVATIVE DATA SYSTEMS RESEARCH (CIDR
, 2007
"... Trio is a new kind of database system that supports data, uncertainty, and lineage in a fully integrated manner. The first Trio prototype, dubbed Trio-One, is built on top of a conventional DBMS using data and query translation techniques together with a small number of stored procedures. This paper ..."
Abstract
-
Cited by 25 (10 self)
- Add to MetaCart
Trio is a new kind of database system that supports data, uncertainty, and lineage in a fully integrated manner. The first Trio prototype, dubbed Trio-One, is built on top of a conventional DBMS using data and query translation techniques together with a small number of stored procedures. This paper describes Trio-One’s translation scheme and system architecture, showing how it efficiently and easily supports the Trio data model and query language.
Anonymized data: generation, models, usage
- 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD
, 2009
"... Data anonymization techniques have been the subject of intense investigation in recent years, for many kinds of structured data, in-cluding tabular, item set and graph data. They enable publication of detailed information, which permits ad hoc queries and analyses, while guaranteeing the privacy of ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
(Show Context)
Data anonymization techniques have been the subject of intense investigation in recent years, for many kinds of structured data, in-cluding tabular, item set and graph data. They enable publication of detailed information, which permits ad hoc queries and analyses, while guaranteeing the privacy of sensitive information in the data against a variety of attacks. In this tutorial, we aim to present a uni-fied framework of data anonymization techniques, viewed through the lens of data uncertainty. Essentially, anonymized data describes a set of possible worlds, one of which corresponds to the original data. We show that anonymization approaches such as suppres-sion, generalization, perturbation and permutation generate differ-ent working models of uncertain data, some of which have been well studied, while others open new directions for research. We demonstrate that the privacy guarantees offered by methods such as k-anonymization and `-diversity can be naturally understood in terms of similarities and differences in the sets of possible worlds that correspond to the anonymized data. We describe how the body of work in query evaluation over uncertain databases can be used for answering ad hoc queries over anonymized data in a principled manner. A key benefit of the unified approach is the identification of a rich set of new problems for both the Data Anonymization and the Uncertain Data communities.