Results 21  30
of
91
Integrating and ranking uncertain scientific data
, 2008
"... Abstract — Mediatorbased data integration systems resolve exploratory queries by joining data elements across sources. In the presence of uncertainties, such multiple expansions can quickly lead to spurious connections and incorrect results. The BioRank project investigates formalisms for modeling ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
Abstract — Mediatorbased data integration systems resolve exploratory queries by joining data elements across sources. In the presence of uncertainties, such multiple expansions can quickly lead to spurious connections and incorrect results. The BioRank project investigates formalisms for modeling uncertainty during scientific data integration and for ranking uncertain query results. Our motivating application is protein function prediction. In this paper we show that: (i) explicit modeling of uncertainties as probabilities increases our ability to predict lessknown or previously unknown functions (though it does not improve predicting the wellknown). This suggests that probabilistic uncertainty models offer utility for scientific knowledge discovery; (ii) small perturbations in the input probabilities tend to produce only minor changes in the quality of our result rankings. This suggests that our methods are robust against slight variations in the way uncertainties are transformed into probabilities; and (iii) several techniques allow us to evaluate our probabilistic rankings efficiently. This suggests that probabilistic query evaluation is not as hard for realworld problems as theory indicates. I.
Scalable Probabilistic Similarity Ranking in Uncertain Databases
"... This paper introduces a scalable approach for probabilistic topk similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutuallyexclusive. The objective is to rank the uncertain data according to their distance to a ref ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
This paper introduces a scalable approach for probabilistic topk similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutuallyexclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several stateoftheart probabilistic ranking models. Existing approaches compute this probability distribution by applying the Poisson binomial recurrence technique of quadratic complexity. In this paper we theoretically as well as experimentally show that our framework reduces this to a lineartime complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic topk ranking for the objects, according to different stateoftheart definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach.
Faster Query Answering in Probabilistic Databases using ReadOnce Functions
"... A boolean expression is in readonce form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in readonce form can be computed in polynomial time (whereas the general problem ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
(Show Context)
A boolean expression is in readonce form if each of its variables appears exactly once. When the variables denote independent events in a probability space, the probability of the event denoted by the whole expression in readonce form can be computed in polynomial time (whereas the general problem for arbitrary expressions is #Pcomplete). Known approaches to checking readonce property seem to require putting these expressions in disjunctive normal form. In this paper, we tell a better story for a large subclass of boolean event expressions: those that are generated by conjunctive queries without selfjoins and on tupleindependent probabilistic databases. We first show that given a tupleindependent representation and the provenance graph of an SPJ query plan without selfjoins, we can, without using the DNF of a result event expression, efficiently compute its cooccurrence graph. From this, the readonce form can already, if it exists, be computed efficiently using existing techniques. Our second and key contribution is a complete, efficient, and simple to implement algorithm for computing the readonce forms (whenever they exist) directly, using a new concept, that of cotable graph, which can be significantly smaller than the cooccurrence graph.
Topk Query Processing in Probabilistic Databases with NonMaterialized Views
, 2012
"... In this paper, we investigate a novel approach of computing confidence bounds for topk ranking queries in probabilistic databases with nonmaterialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the topranked query answers according to their marginal probabil ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
In this paper, we investigate a novel approach of computing confidence bounds for topk ranking queries in probabilistic databases with nonmaterialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the topranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of selectprojectjoin views, the latter of which are cast into Datalog rules, where also the rules themselves may be uncertain, i.e., be valid with some degree of confidence. To our knowledge, this work is the first to address integrated data and confidence computations in the context of probabilistic databases by considering confidence bounds over partially evaluated query answers with firstorder lineage formulas. We further extend our query processing techniques by a toolsuite of scheduling strategies based on selectivity estimation and the expected impact of subgoals on the final confidence of answer candidates. Experiments with large datasets demonstrate drastic runtime improvements over both sampling and decompositionbased methods—even
Creating Probabilistic Databases from Imprecise TimeSeries Data
"... Abstract—Although efficient processing of probabilistic databases is a wellestablished field, a wide range of applications are still unable to benefit from these techniques due to the lack of means for creating probabilistic databases. In fact, it is a challenging problem to associate concrete prob ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
Abstract—Although efficient processing of probabilistic databases is a wellestablished field, a wide range of applications are still unable to benefit from these techniques due to the lack of means for creating probabilistic databases. In fact, it is a challenging problem to associate concrete probability values with given timeseries data for forming a probabilistic database, since the probability distributions used for deriving such probability values vary over time. In this paper, we propose a novel approach to create tuplelevel probabilistic databases from (imprecise) timeseries data. To the best of our knowledge, this is the first work that introduces a generic solution for creating probabilistic databases from arbitrary time series, which can work in online as well as offline fashion. Our approach consists of two key components. First, the dynamic density metrics that infer timedependent probability distributions for time series, based on various mathematical models. Our main metric, called the GARCH metric, can robustly capture such evolving probability distributions regardless of the presence of erroneous values in a given time series. Second, the Ω–View builder that creates probabilistic databases from the probability distributions inferred by the dynamic density metrics. For efficient processing, we introduce the σ–cache that reuses the information derived from probability values generated at previous times. Extensive experiments over real datasets demonstrate the effectiveness of our approach. I.
A TemporalProbabilistic Database Model for Information Extraction
"... Temporal annotations of facts are a key component both for building a highaccuracy knowledge base and for answering queries over the resulting temporal knowledge base with high precision and recall. In this paper, we present a temporalprobabilistic database model for cleaning uncertain temporal fac ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Temporal annotations of facts are a key component both for building a highaccuracy knowledge base and for answering queries over the resulting temporal knowledge base with high precision and recall. In this paper, we present a temporalprobabilistic database model for cleaning uncertain temporal facts obtained from information extraction methods. Specifically, we consider a combination of temporal deduction rules, temporal consistency constraints and probabilistic inference based on the common possibleworlds semantics with data lineage, and we study the theoretical properties of this data model. We further develop a query engine which is capable of scaling to very large temporal knowledge bases, with nearly interactive query response times over millions of uncertain facts and hundreds of thousands of grounded rules. Our experiments over two realworld datasets demonstrate the increased robustness of our approach compared to related techniques based on constraint solving via Integer Linear Programming (ILP) and probabilistic inference via Markov Logic Networks (MLNs). We are also able to show that our runtime performance is more than competitive to current ILP solvers and the fastest available, probabilistic but nontemporal, database engines. 1.
Histograms and wavelets on probabilistic data
 IN ICDE
, 2009
"... There is a growing realization that uncertain information is a firstclass citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
There is a growing realization that uncertain information is a firstclass citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram and Haar waveletbased synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamicprogrammingbased techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time.
Attribute and object selection queries on objects with probabilistic attributes
 ACM Transactions on Database Systems (ACM TODS
, 2012
"... Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
Modern data processing techniques such as entity resolution, data cleaning, information extraction, and automated tagging often produce results consisting of objects whose attributes may contain uncertainty. This uncertainty is frequently captured in the form of a set of multiple mutually exclusive value choices for each uncertain attribute along with a measure of probability for alternative values. However, the lay enduser, as well as some endapplications, might not be able to interpret the results if outputted in such a form. Thus, the question is how to present such results to the user in practice, for example, to support attributevalue selection and object selection queries the user might be interested in. Specifically, in this article we study the problem of maximizing the quality of these selection queries on top of such a probabilistic representation. The quality is measured using the standard and commonly used setbased quality metrics. We formalize the problem and then develop efficient approaches that provide highquality answers for these queries. The comprehensive empirical evaluation over three different domains demonstrates the advantage of our approach over existing techniques.
Ranking Distributed Probabilistic Data
, 2009
"... Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
(Show Context)
Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global topk tuples from all distributed sites with minimum communication cost. Using the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication and computationefficient algorithms for retrieving the topk tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.
LIVE: A lineagesupported versioned DBMS
, 2010
"... Abstract — This paper presents LIVE, a complete DBMS designed for applications with many stored derived relations, and with a need for simple versioning capabilities when base data is modified. Target applications include, for example, scientific data management and data integration. A key feature o ..."
Abstract

Cited by 7 (5 self)
 Add to MetaCart
Abstract — This paper presents LIVE, a complete DBMS designed for applications with many stored derived relations, and with a need for simple versioning capabilities when base data is modified. Target applications include, for example, scientific data management and data integration. A key feature of LIVE is the use of lineage (provenance) to support modifications and versioning in this environment. In our system, lineage significantly facilitates both: (1) efficient propagation of modifications from base to derived data; and (2) efficient execution of a wide class of queries over versioned, derived data. LIVE is fully implemented; detailed experimental results are presented that validate our techniques. I.