Results 1 - 10
of
15
MCDB: a Monte Carlo approach to managing uncertain data
, 2008
"... To deal with data uncertainty, existing probabilistic database sys-tems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system’s ability to gracefully handle complex or unforese ..."
Abstract
-
Cited by 110 (3 self)
- Add to MetaCart
To deal with data uncertainty, existing probabilistic database sys-tems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system’s ability to gracefully handle complex or unforeseen types of uncertainty, and does not permit the uncertainty model to be dynamically parameterized ac-cording to the current state of the database. We introduce MCDB, a system for managing uncertain data that is based on a Monte Carlo approach. MCDB represents uncertainty via “VG functions,” which are used to pseudorandomly generate realized values for un-certain attributes. VG functions can be parameterized on the re-sults of SQL queries over “parameter tables ” that are stored in the database, facilitating what-if analyses. By storing parameters, and not probabilities, and by estimating, rather than exactly com-puting, the probability distribution over possible query answers, MCDB avoids many of the limitations of prior systems. For ex-ample, MCDB can easily handle arbitrary joint probability distri-butions over discrete or continuous attributes, arbitrarily complex SQL queries, and arbitrary functionals of the query-result distri-bution such as means, variances, and quantiles. To achieve good performance, MCDB uses novel query processing techniques, exe-cuting a query plan exactly once, but over “tuple bundles ” instead of ordinary tuples. Experiments indicate that our enhanced func-tionality can be obtained with acceptable overheads relative to tra-ditional systems.
Database Support for Probabilistic Attributes and Tuples
- In IEEE 24th Intl. Conference on Data Engineering
, 2008
"... Abstract — The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous an ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
(Show Context)
Abstract — The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous and discrete data domains. This paper presents a model for handling arbitrary probabilistic uncertain data (both discrete and continuous) natively at the database level. Our approach leads to a natural and efficient representation for probabilistic data. We develop a model that is consistent with possible worlds semantics and closed under basic relational operators. This is the first model that accurately and efficiently handles both continuous and discrete uncertainty. The model is implemented in a real database system (PostgreSQL) and the effectiveness and efficiency of our approach is validated experimentally. I.
Data Exchange beyond Complete Data
"... In the traditional data exchange setting, source instances are restricted to be complete in the sense that every fact is either true or false in these instances. Although natural for a typical database translation scenario, this restriction is gradually becoming an impediment to the development of a ..."
Abstract
-
Cited by 24 (14 self)
- Add to MetaCart
In the traditional data exchange setting, source instances are restricted to be complete in the sense that every fact is either true or false in these instances. Although natural for a typical database translation scenario, this restriction is gradually becoming an impediment to the development of a wide range of applications that need to exchange objects that admit several interpretations. In particular, we are motivated by two specific applications that go beyond the usual data exchange scenario: exchanging incomplete information and exchanging knowledge bases. In this paper, we propose a general framework for data exchange that can deal with these two applications. More specifically, we address the problem of exchanging information given by representation systems, which are essentially finite descriptions of (possibly infinite) sets of complete instances. We make use of the classical semantics of mappings specified by sets of logical sentences to give a meaningful semantics to the notion of exchanging representatives, from which the standard notions of solution, space of solutions, and universal solution naturally arise. We also introduce the notion of strong representation system for a class of mappings, that resembles the concept of strong representation system for a query language. We show the robustness of our proposal by applying it to the two applications mentioned above: exchanging incomplete information and exchanging knowledge bases, which are both instantiations of the exchanging problem for representation systems. We study these two applications in detail, presenting results regarding expressiveness, query answering and complexity of computing solutions, and also algorithms to materialize solutions.
OLAP over Imprecise Data With Domain Constraints
, 2007
"... Several recent works have focused on OLAP over imprecise data, where each fact can be a region, instead of a point, in a multidimensional space. They have provided a multiple-world semantics for such data, and developed efficient solutions to answer OLAP aggregation queries over the imprecise facts. ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Several recent works have focused on OLAP over imprecise data, where each fact can be a region, instead of a point, in a multidimensional space. They have provided a multiple-world semantics for such data, and developed efficient solutions to answer OLAP aggregation queries over the imprecise facts. These solutions however assume that the imprecise facts can be interpreted independently of one another, a key assumption that is often violated in practice. Indeed, imprecise facts in real-world applications are often correlated, and such correlations can be captured as domain integrity constraints (e.g., repairs with the same customer names and models took place in the same city, or a text span can refer to a person or a city, but not both). In this paper we provide a solution to answer OLAP aggregation queries over imprecise data, in the presence of such domain constraints. We first describe a relatively simple yet powerful constraint language, and define what it means to take into account such constraints in query answering. Next, we prove that OLAP queries can be answered efficiently given a database D ∗ of fact marginals. We then exploit the regularities in the constraint space (captured in a constraint hypergraph) and the fact space to efficiently construct D*. Extensive experiments over real-world and synthetic data demonstrate the effectiveness of our approach.
GraSS: Graph Structure Summarization
"... Large graph databases are commonly collected and analyzed in numerous domains. For reasons related to either space efficiency or for privacy protection (e.g., in the case of social network graphs), it sometimes makes sense to replace the original graph with a summary, which removes certain details a ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Large graph databases are commonly collected and analyzed in numerous domains. For reasons related to either space efficiency or for privacy protection (e.g., in the case of social network graphs), it sometimes makes sense to replace the original graph with a summary, which removes certain details about the original graph topology. However, this summarization process leaves the database owner with the challenge of processing queries that are expressed in terms of the original graph, but are answered using the summary. In this paper, we propose a formal semantics for answering queries on summaries of graph structures. At its core, our formulation is based on a random worlds model. We show that important graph-structure queries (e.g., adjacency, degree, and eigenvector centrality) can be answered efficiently and in closed form using these semantics. Further, based on this approach to query answering, we formulate three novel graph partitioning/compression problems. We develop algorithms for finding a graph summary that least affects the accuracy of query results, and we evaluate our proposed algorithms using both real and synthetic data. 1
Efficient Recovery of Missing Events
"... For various entering and transmission issues raised by human or system, missing events often occur in event data, which record execution logs of business processes. Without recovering these missing events, applications such as provenance analysis or complex event processing built upon event data are ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
For various entering and transmission issues raised by human or system, missing events often occur in event data, which record execution logs of business processes. Without recovering these missing events, applications such as provenance analysis or complex event processing built upon event data are not reliable. Following the minimum change discipline in improving data quality, it is also rational to find a recovery that minimally differs from the original data. Existing recovery approaches fall short of efficiency owing to enumerating and searching over all the possible sequences of events. In this paper, we study the efficient techniques for recovering missing events. According to our theoretical results, the recovery problem is proved to be NP-hard. Nevertheless, we are able to concisely represent the space of event sequences in a branching framework. Advanced indexing and pruning techniques are developed to further improve the recovery efficiency. Our proposed efficient techniques make it possible to find top-k recoveries. The experimental results demonstrate that our minimum recovery approach achieves high accuracy, and significantly outperforms the state-of-the-art technique for up to 5 orders of magnitudes improvement in time performance. 1.
Query Selectivity Estimation for Uncertain Data
- In 20th Intl. Conf. on Scientific and Statistical Database Management
, 2008
"... Abstract. Applications requiring the handling of uncertain data have led to the development of database management systems extending the scope of relational databases to include uncertain (probabilistic) data as a native data type. New automatic query optimizations having the ability to estimate the ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Applications requiring the handling of uncertain data have led to the development of database management systems extending the scope of relational databases to include uncertain (probabilistic) data as a native data type. New automatic query optimizations having the ability to estimate the cost of execution of a given query plan, as available in existing databases, need to be developed. For probabilistic data this involves providing selectivity estimations that can handle multiple values for each attribute and also new query types with threshold values. This paper presents novel selectivity estimation functions for uncertain data and shows how these functions can be integrated into PostgreSQL to achieve query optimization for probabilistic queries over uncertain data. The proposed methods are able to handle both attribute- and tuple-uncertainty. Our experimental results show that our algorithms are efficient and give good selectivity estimates with low space-time overhead. 1
Detecting the Temporal Context of Queries
"... Abstract. Business intelligence and reporting tools rely on a database that accurately mirrors the state of the world. Yet, even if the schema and queries are constructed in exacting detail, assumptions about the data made during extraction, transformation, and schema and query creation of the repor ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Business intelligence and reporting tools rely on a database that accurately mirrors the state of the world. Yet, even if the schema and queries are constructed in exacting detail, assumptions about the data made during extraction, transformation, and schema and query creation of the reporting database may be (accidentally) ignored by end users, or may change as the database evolves over time. As these assumptions are typically implicit (e.g., assuming that a sales record relation is append-only), it can be hard to even detect that a mistaken assumption has been made. In this paper, we argue that such errors are consequences of unin-tended contextual dependence, i.e., query outputs dependent on a variable characteristic of the database. We characterize contextual dependence, and explore several strategies for efficiently detecting and quantifying the effects of contextual dependence on query outputs. We present and evaluate our findings in the context of a concrete case study: Detecting temporal dependence using a database management system with ver-sioning capabilities. 1
Continuous Probabilistic Sum Queries in Wireless Sensor Networks with Ranges
"... Data measured in wireless sensor networks are inherently imprecise, due to a number of reasons, and aggregate queries are often used to analyze the collected data in order to alleviate the impact of such imprecision. In this paper we will deal with the imprecision in the measured values explicitly ..."
Abstract
- Add to MetaCart
(Show Context)
Data measured in wireless sensor networks are inherently imprecise, due to a number of reasons, and aggregate queries are often used to analyze the collected data in order to alleviate the impact of such imprecision. In this paper we will deal with the imprecision in the measured values explicitly by employing a probabilistic approach and we focus on one particular type of aggregate query, namely the SUM query. We consider that sensors in the network may, operate (all collectively at the same time) in two different modes: (1) returning a finite set of discrete values with a probability attached to each value, or (2) a continuous probabilistic density function over a possibly infinite set of possible values. Our foremost concern is to present the first algorithms to efficiently compute the probabilistic SUM according to the possible world semantics, i.e., without any loss of information. Furthermore, we show how this query can be efficiently updated in dynamic environments where sensor values change often and we show techniques to distribute computation over all network nodes. Our experimental results show that processing queries in-network and incrementally as opposed to collecting the measured values from all nodes at the base station and computing the answer centrally, can reduce the total number of messages sent by at least 50%, thus saving energy and extending the network’s lifetime, a chief concern regarding wireless sensor networks.