Results 1 - 10
of
17
Data integration with uncertainties
- In Proc. of VLDB
, 2007
"... This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approxim ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, queries to the system may be posed with keywords rather than in a structured form. Third, the data from the sources may be extracted using information extraction techniques and so may yield imprecise data. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we don’t know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of approximate schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting. 1.
Query Processing over Incomplete Autonomous Databases
"... Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possible answers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possible answers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possible answers with high precision, high recall, and manageable cost. 1.
Making Aggregation Work in Uncertain and Probabilistic Databases
, 2007
"... Abstract. We describe how aggregation is handled in the Trio system for uncertain and probabilistic data. Because “exact ” aggregation in uncertain databases can produce exponentially-sized results, we provide three alternatives: a low bound on the aggregate value, a high bound on the value, and the ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Abstract. We describe how aggregation is handled in the Trio system for uncertain and probabilistic data. Because “exact ” aggregation in uncertain databases can produce exponentially-sized results, we provide three alternatives: a low bound on the aggregate value, a high bound on the value, and the expected value. These variants return a single result instead of a set of possible results, and they are generally very efficient to compute for both full-table and grouped aggregation queries. We provide formal definitions and semantics, a description of our implementation, and some preliminary analytical and experimental results for the one aggregate (expected-average) for which we compute an approximation. 1
Query processing over incomplete autonomous databases
- In Proc. ICDE
, 2006
"... Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such relevant uncertain answers and gauge their relevance by accessing their likelihood of being relevant answers to the query. The autonomous nature of the databases poses several challenges in realizing this idea. Such challenges include the restricted access privileges, limited query patterns and sensitivity of database and network resource consumption in the web environment. We introduce a novel query rewriting and optimization framework that tackles these challenges. Our technique involves reformulating the user query based on approximate functional dependencies (AFDs) among the database attributes. The reformulated queries are aimed at retrieving the relevant uncertain answers in addition to the certain answers. Our query processing framework QPIAD is able to gauge the relevance of such reformulated queries to manage the cost of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of AFDs) and value distributions(using Naïve Bayes Classifiers). We present empirical studies to demonstrate that our approach is effective in retrieving relevant uncertain answers with high precision, high recall and manageable cost. 1
Continuous Probabilistic Nearest-Neighbor Queries for Uncertain Trajectories
"... This work addresses the problem of processing continuous nearest neighbor (NN) queries for moving objects trajectories when the exact position of a given object at a particular time instant is not known, but is bounded by an uncertainty region. As has already been observed in the literature, the ans ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
This work addresses the problem of processing continuous nearest neighbor (NN) queries for moving objects trajectories when the exact position of a given object at a particular time instant is not known, but is bounded by an uncertainty region. As has already been observed in the literature, the answers to continuous NN-queries in spatio-temporal settings are time parameterized in the sense that the objects in the answer vary over time. Incorporating uncertainty in the model yields additional attributes that affect the semantics of the answer to this type of queries. In this work, we formalize the impact of uncertainty on the answers to the continuous probabilistic NN-queries, provide a compact structure for their representation and efficient algorithms for constructing that structure. We also identify syntactic constructs for several qualitative variants of continuous probabilistic NN-queries for uncertain trajectories and present efficient algorithms for their processing. 1.
Similarity search on bregman divergence: Towards non-metric indexing
- In VLDB
, 2009
"... In this paper, we examine the problem of indexing over non-metric distance functions. In particular, we focus on a general class of distance functions, namely Bregman Divergence [6], to support nearest neighbor and range queries. Distance functions such as KL-divergence and Itakura-Saito distance, a ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper, we examine the problem of indexing over non-metric distance functions. In particular, we focus on a general class of distance functions, namely Bregman Divergence [6], to support nearest neighbor and range queries. Distance functions such as KL-divergence and Itakura-Saito distance, are special cases of Bregman divergence, with wide applications in statistics, speech recognition and time series analysis among others. Unlike in metric spaces, key properties such as triangle inequality and distance symmetry do not hold for such distance functions. A direct adaptation of existing indexing infrastructure developed for metric spaces is thus not possible. We devise a novel solution to handle this class of distance measures by expanding and mapping points in the original space to a new extended space. Subsequently, we show how state-of-the-art tree-based indexing methods, for low to moderate dimensional datasets, and vector approximation file (VA-file) methods, for high dimensional datasets, can be adapted on this extended space to answer such queries efficiently. Improved distance bounding techniques and distribution-based index optimization are also introduced to improve the performance of query answering and index construction respectively, which can be applied on both the R-trees and VA files. Extensive experiments are conducted to validate our approach on a variety of datasets and a range of Bregman divergence functions. 1.
Histograms and wavelets on probabilistic data
- In ICDE
, 2009
"... Abstract — There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurat ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract — There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamicprogramming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time. I.
Taming Data Explosion in Probabilistic Information Integration
- In Proceedings of the International Workshop on Inconsistency and Incompleteness in Databases (IIDB), March 26, 2006
, 2006
"... Abstract. Data integration has been a challenging problem for decades. In an ambient environment, where many autonomous devices have their own information sources and network connectivity is ad hoc and peer-topeer, it even becomes a serious bottleneck. To enable devices to exchange information witho ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Data integration has been a challenging problem for decades. In an ambient environment, where many autonomous devices have their own information sources and network connectivity is ad hoc and peer-topeer, it even becomes a serious bottleneck. To enable devices to exchange information without the need for interaction with a user at data integration time and without the need for extensive semantic annotations, a probabilistic approach seems rather promising. It simply teaches the device how to cope with the uncertainty occurring during data integration. Unfortunately, without any kind of world knowledge, almost everything becomes uncertain, hence maintaining all possibilities produces huge integrated information sources. In this paper, we claim that only very simple and generic rules are enough world knowledge to drastically reduce the amount of uncertainty, hence to tame the data explosion to a manageable size. 1
Mining Sequential Patterns from Probabilistic Databases by Pattern-Growth
"... Abstract. We propose a pattern-growth approach for mining sequential patterns from probabilistic databases. Our considered model of uncertainty is about the situations where there is uncertainty in associating an event with a source; and consider the problem of enumerating all sequences whose expect ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. We propose a pattern-growth approach for mining sequential patterns from probabilistic databases. Our considered model of uncertainty is about the situations where there is uncertainty in associating an event with a source; and consider the problem of enumerating all sequences whose expected support satisfies a user-defined threshold θ. In an earlier work [Muzammal and Raman, PAKDD’11], adapted representative candidate generate-and-test approaches, GSP (breadth-first sequence lattice traversal) and SPADE/SPAM (depth-first sequence lattice traversal) to the probabilistic case. The authors also noted the difficulties in generalizing PrefixSpan to the probabilistic case (PrefixSpan is a pattern-growth algorithm, considered to be the best performer for deterministic sequential pattern mining). We overcome these difficulties in this note and adapt PrefixSpan to work under probabilistic settings. We then report on an experimental evaluation of the candidate generateand-test approaches against the pattern-growth approach.
QUIC: Handling Query Imprecision & Data Incompleteness in Autonomous Databases ABSTRACT
"... As more and more information from autonomous databases becomes available to lay users, query processing over these databases must adapt to deal with the imprecise nature of user queries as well as incompleteness in the data due to missing attribute values (aka “null values”). In such scenarios, the ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
As more and more information from autonomous databases becomes available to lay users, query processing over these databases must adapt to deal with the imprecise nature of user queries as well as incompleteness in the data due to missing attribute values (aka “null values”). In such scenarios, the query processor begins to acquire the role of a recommender system. Specifically, in addition to presenting answers which satisfy the user’s query, the query processor is expected to provide highly relevant answers even though they do not exactly satisfy the query predicates. This broadened view of query processing poses several technical challenges. We propose a decision theoretic model for ranking answers in the in the order of their expected relevance to the user. This model combines a relevance function that reflects the relevance a user would associate with answer tuples and a density function which reflects the each tuple’s distribution of missing data. Adoption of this model foregrounds three general challenges: (i) how to assess the relevance and density functions automatically (ii) how to support efficient query processing to retrieve relevant tuples and (iii) how to make users trust the recommended answers. We present a general framework for addressing these challenges, describe a preliminary implementation of the QUIC system and discuss the results of our preliminary empirical evaluation. 1.

