Results 1 - 10
of
168
Evaluating Probabilistic Queries over Imprecise Data
- In SIGMOD
, 2003
"... Sensors are often employed to monitor continuously changing entities like locations of moving ob-jects and temperature. The sensor readings are reported to a database system, and are subsequently used to answer queries. Due to continuous changes in these values and limited resources (e.g., net-work ..."
Abstract
-
Cited by 278 (45 self)
- Add to MetaCart
(Show Context)
Sensors are often employed to monitor continuously changing entities like locations of moving ob-jects and temperature. The sensor readings are reported to a database system, and are subsequently used to answer queries. Due to continuous changes in these values and limited resources (e.g., net-work bandwidth and battery power), the database may not be able to keep track of the actual values of the entities. Queries that use these old values may produce incorrect answers. However, if the degree of uncertainty between the actual data value and the database value is limited, one can place more confidence in the answers to the queries. More generally, query answers can be augmented with probabilistic guarantees of the validity of the answers. In this paper, we study probabilistic query evaluation based on uncertain data. A classification of queries is made based upon the nature of the result set. For each class, we develop algorithms for computing probabilistic answers, and provide efficient indexing and numeric solutions. We address the important issue of measuring the quality of the answers to these queries, and provide algorithms for efficiently pulling data from relevant sensors or moving objects in order to improve the quality of the executing queries. Extensive experiments
Top-k query processing in uncertain databases
- In ICDE
, 2007
"... Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on ..."
Abstract
-
Cited by 125 (9 self)
- Add to MetaCart
(Show Context)
Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty makes traditional techniques inapplicable. We introduce new probabilistic formulations for top-k queries. Our formulations are based on “marriage ” of traditional top-k semantics and possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a state space model and efficient query processing techniques to tackle the challenges of uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples and materialized search states. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naïve materialization of possible worlds. 1
Efficient Indexing Methods for Probabilistic Threshold Queries over Uncertain Data
- Proc. 30th Int’l Conf. Very Large Data Bases (VLDB
, 2004
"... It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of un ..."
Abstract
-
Cited by 123 (22 self)
- Add to MetaCart
(Show Context)
It is infeasible for a sensor database to contain the exact value of each sensor at all points in time. This uncertainty is inherent in these systems due to measurement and sampling errors, and resource limitations. In order to avoid drawing erroneous conclusions based upon stale data, the use of uncertainty intervals that model each data item as a range and associated probability density function (pdf) rather than a single value has recently been proposed. Querying these uncertain data introduces imprecision into answers, in the form of probability values that specify the likeliness the answer satisfies the query. These queries are more expensive to evaluate than their traditional counterparts but are guaranteed to be correct and more informative due to the probabilities accompanying the answers. Although the answer probabilities are useful, for many applications, it is only necessary to know whether the probability exceeds a given threshold – we term these Probabilistic Threshold Queries (PTQ). In this paper we address the efficient computation of these types of queries. In particular, we develop two index structures and associated algorithms to efficiently answer PTQs. The first index scheme is based on the idea of augmenting uncertainty information to an R-tree. We establish the difficulty
Indexing multi-dimensional uncertain data with arbitrary probability density functions
- In Proc. VLDB
, 2005
"... In an “uncertain database”, an object o is associated with a multi-dimensional probability density function (pdf), which describes the likelihood that o appears at each position in the data space. A fundamental operation is the “probabilistic range search ” which, given a value pq and a rectangular ..."
Abstract
-
Cited by 116 (15 self)
- Add to MetaCart
(Show Context)
In an “uncertain database”, an object o is associated with a multi-dimensional probability density function (pdf), which describes the likelihood that o appears at each position in the data space. A fundamental operation is the “probabilistic range search ” which, given a value pq and a rectangular area rq, retrieves the objects that appear in rq with probabilities at least pq. In this paper, we propose the U-tree, an access method designed to optimize both the I/O and CPU time of range retrieval on multi-dimensional imprecise data. The new structure is fully dynamic (i.e., objects can be incrementally inserted/deleted in any order), and does not place any constraints on the data pdfs. We verify the query and update efficiency of U-trees with extensive experiments. 1
Data integration with uncertainty.
- In Proc. of VLDB,
, 2007
"... Abstract This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may b ..."
Abstract
-
Cited by 109 (6 self)
- Add to MetaCart
(Show Context)
Abstract This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, queries to the system may be posed with keywords rather than in a structured form. Third, the data from the sources may be extracted using information extraction techniques and so may yield erroneous data. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we don't know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of approximate schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting.
Exploiting relationships for domain-independent data cleaning
, 2005
"... In this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each ..."
Abstract
-
Cited by 81 (24 self)
- Add to MetaCart
(Show Context)
In this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RelDC) and the traditional techniques is that RelDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real datasets and also over synthetic datasets show that analysis of relationships significantly improves quality of the result.
A Survey of Uncertain Data Algorithms and Applications
- IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2009
"... In recent years, a number of indirect data collection methodologies have led to the proliferation of uncertain data. Such databases are much more complex because of the additional challenges of representing the probabilistic information. In this paper, we provide a survey of uncertain data mining a ..."
Abstract
-
Cited by 68 (13 self)
- Add to MetaCart
(Show Context)
In recent years, a number of indirect data collection methodologies have led to the proliferation of uncertain data. Such databases are much more complex because of the additional challenges of representing the probabilistic information. In this paper, we provide a survey of uncertain data mining and management applications. We will explore the various models utilized for uncertain data representation. In the field of uncertain data management, we will examine traditional database management methods such as join processing, query processing, selectivity estimation, OLAP queries, and indexing. In the field of uncertain data mining, we will examine traditional mining problems such as frequent pattern mining, outlier detection, classification, and clustering. We discuss different methodologies to process and mine uncertain data in a variety of forms.
Domain-independent data cleaning via analysis of entity-relationship graph
- ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS
, 2006
"... In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which e ..."
Abstract
-
Cited by 64 (23 self)
- Add to MetaCart
In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RELDC) and the traditional techniques is that RELDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.
Main Memory Evaluation of Monitoring Queries over Moving Objects
- Distributed and Parallel Databases
, 2004
"... In this paper we evaluate several in-memory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of o ..."
Abstract
-
Cited by 59 (6 self)
- Add to MetaCart
(Show Context)
In this paper we evaluate several in-memory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of objects that move at any moment in time. We present a detailed analysis of a grid approach which shows the best results for both skewed and uniform data. A sorting based optimization is developed for significantly improving the cache hit-rate. Experimental evaluation establishes that indexing queries using the grid index yields orders of magnitude better performance than other index structures such as R*-trees. 1
Monochromatic and Bichromatic Reverse Skyline Search over Uncertain Databases
, 2008
"... Reverse skyline queries over uncertain databases have many important applications such as sensor data monitoring and business planning. Due to the existence of uncertainty in many real-world data, answering reverse skyline queries accurately and efficiently over uncertain data has become increasingl ..."
Abstract
-
Cited by 56 (2 self)
- Add to MetaCart
Reverse skyline queries over uncertain databases have many important applications such as sensor data monitoring and business planning. Due to the existence of uncertainty in many real-world data, answering reverse skyline queries accurately and efficiently over uncertain data has become increasingly important. In this paper, we model the probabilistic reverse skyline query on uncertain data, in both monochromatic and bichromatic cases, and propose effective pruning methods to reduce the search space of query processing. Moreover, efficient query procedures have been presented seamlessly integrating the proposed pruning methods. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approach with various experimental settings.