Results 1  10
of
27
Semantics of ranking queries for probabilistic data and expected ranks
 In Proc. of ICDE’09
, 2009
"... Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditi ..."
Abstract

Cited by 62 (1 self)
 Add to MetaCart
(Show Context)
Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditional database settings. The importance of the topk is perhaps even greater in probabilistic databases, where a relation can encode exponentially many possible worlds. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a topk over deterministic data. Specifically, we define a number of fundamental properties, including exactk, containment, uniquerank, valueinvariance, and stability, which are all satisfied by ranking queries on certain data. We argue that all these conditions should also be fulfilled by any reasonable definition for ranking uncertain data. Unfortunately, none of the existing definitions is able to achieve this. To remedy this shortcoming, this work proposes an intuitive new approach of expected rank. This uses the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of the ranking. We are able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query. We provide efficient solutions to compute this ranking across the major models of uncertain data, such as attributelevel and tuplelevel uncertainty. For an uncertain relation of N tuples, the processing cost is O(N log N)—no worse than simply sorting the relation. In settings where there is a high cost for generating each tuple in turn, we provide pruning techniques based on probabilistic tail bounds that can terminate the search early and guarantee that the topk has been found. Finally, a comprehensive experimental study confirms the effectiveness of our approach. I.
Computing all skyline probabilities for uncertain data
 In PODS
, 2009
"... Skyline computation is widely used in multicriteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied, e.g. probabilistic skylines. The previous work requires “thresholding ” for its efficiency – the efficiency ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
(Show Context)
Skyline computation is widely used in multicriteria decision making. As research in uncertain databases draws increasing attention, skyline queries with uncertain data have also been studied, e.g. probabilistic skylines. The previous work requires “thresholding ” for its efficiency – the efficiency relies on the assumption that points with skyline probabilities below a certain threshold can be ignored. But there are situations where “thresholding”is not desirable – low probability events cannot be ignored when their consequences are significant. In such cases it is necessary to compute skyline probabilities of all data items. We provide the first algorithm for this problem whose worstcase time complexity is subquadratic. The techniques we use are interesting in their own right, as they rely on a space partitioning technique combined with using the existing dominance counting algorithm. The effectiveness of our algorithm is experimentally verified. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query processing;
Query answering techniques on uncertain and probabilistic data
 In SIGMOD 2008
"... Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing large collections ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
(Show Context)
Uncertain data are inherent in some important applications, such as environmental surveillance, market analysis, and quantitative economics research. Due to the importance of those applications and the rapidly increasing amount of uncertain data collected and accumulated, analyzing large collections of uncertain data has become an important task and has attracted more and more interest from the database community. Recently, uncertain data management has become an emerging hot area in database research and development. In this tutorial, we systematically review some representative studies on answering various queries on uncertain and probabilistic data.
Topk dominating queries in uncertain databases
 in EDBT, 2009
"... Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate and tackle an important query, namely probabilist ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
(Show Context)
Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate and tackle an important query, namely probabilistic topk dominating (PTD) query, in the uncertain database. In particular, a PTD query retrieves k uncertain objects that are expected to dynamically dominate the largest number of uncertain objects. We propose an effective pruning approach to reduce the PTD search space, and present an efficient query procedure to answer PTD queries. Furthermore, approximate PTD query processing and the case where the PTD query is issued from an uncertain query object are also discussed. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed PTD query processing approaches. 1.
Probabilistic Similarity Search for Uncertain Time Series
, 2009
"... A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic rang ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic range queries over uncertain time series. Furthermore, we propose an original approximate representation of uncertain time series that can be used to efficiently support both new query types by upper and lower bounding the Euclidean distance.
Scalable Probabilistic Similarity Ranking in Uncertain Databases
"... Abstract—This paper introduces a scalable approach for probabilistic topk similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutuallyexclusive. The objective is to rank the uncertain data according to their distance ..."
Abstract

Cited by 8 (5 self)
 Add to MetaCart
(Show Context)
Abstract—This paper introduces a scalable approach for probabilistic topk similarity ranking on uncertain vector data. Each uncertain object is represented by a set of vector instances that are assumed to be mutuallyexclusive. The objective is to rank the uncertain data according to their distance to a reference object. We propose a framework that incrementally computes for each object instance and ranking position, the probability of the object falling at that ranking position. The resulting rank probability distribution can serve as input for several stateoftheart probabilistic ranking models. Existing approaches compute this probability distribution by applying the Poisson binomial recurrence technique of quadratic complexity. In this paper we theoretically as well as experimentally show that our framework reduces this to a lineartime complexity while having the same memory requirements, facilitated by incremental accessing of the uncertain vector instances in increasing order of their distance to the reference object. Furthermore, we show how the output of our method can be used to apply probabilistic topk ranking for the objects, according to different stateoftheart definitions. We conduct an experimental evaluation on synthetic and real data, which demonstrates the efficiency of our approach.
Ranking Distributed Probabilistic Data
, 2009
"... Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Ranking queries are essential tools to process large amounts of probabilistic data that encode exponentially many possible deterministic instances. In many applications where uncertainty and fuzzy information arise, data are collected from multiple sources in distributed, networked locations, e.g., distributed sensor fields with imprecise measurements, multiple scientific institutes with inconsistency in their scientific data. Due to the network delay and the economic cost associated with communicating large amounts of data over a network, a fundamental problem in these scenarios is to retrieve the global topk tuples from all distributed sites with minimum communication cost. Using the wellfounded notion of the expected rank of each tuple across all possible worlds as the basis of ranking, this work designs both communication and computationefficient algorithms for retrieving the topk tuples with the smallest ranks from distributed sites. Extensive experiments using both synthetic and real data sets confirm the efficiency and superiority of our algorithms over the straightforward approach of forwarding all data to the server.
Semantics of Ranking Queries for Probabilistic Data
 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
"... Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a topk over deterministic data. We define numerous fundamental properties, including exactk, containment, uniquerank, value ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Recently, there have been several attempts to propose definitions and algorithms for ranking queries on probabilistic data. However, these lack many intuitive properties of a topk over deterministic data. We define numerous fundamental properties, including exactk, containment, uniquerank, valueinvariance, and stability, which are satisfied by ranking queries on certain data. We argue these properties should also be carefully studied in defining ranking queries in probabilistic data, and fulfilled by definition for ranking uncertain data for most applications. We propose an intuitive new ranking definition based on the observation that the ranks of a tuple across all possible worlds represent a wellfounded rank distribution. We studied the ranking definitions based on the expectation, the median and other statistics of this rank distribution for a tuple and derived the expected rank, median rank and quantile rank correspondingly. We are able to prove that the expected rank, median rank and quantile rank satisfy all these properties for a ranking query. We provide efficient solutions to compute such rankings across the major models of uncertain data, such as attributelevel and tuplelevel uncertainty. Finally, a comprehensive experimental study confirms the effectiveness of our approach.
Indexing Probabilistic NearestNeighbor Threshold Queries
"... Abstract. Data uncertainty is inherent in many applications, including sensor networks, scientific data management, data integration, locationbased applications, etc. One of common queries for uncertain data is the probabilistic nearest neighbor (PNN) query that returns all uncertain objects with no ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract. Data uncertainty is inherent in many applications, including sensor networks, scientific data management, data integration, locationbased applications, etc. One of common queries for uncertain data is the probabilistic nearest neighbor (PNN) query that returns all uncertain objects with nonzero probabilities to be NN. In this paper we study the PNN query with a probability threshold (PNNT), which returns all objects with the NN probability greater than the threshold. Our PNNT query removes the assumption in all previous papers that the probability of an uncertain object always adds up to 1, i.e., we consider missing probabilities. We propose an augmented Rtree index with additional probabilistic information to facilitate pruning as well as global data structures for maintaining the current pruning status. We present our algorithm for efficiently answering PNNT queries and perform experiments to show that our algorithm significantly reduces the number of objects that need to be further evaluated as NN candidates. 1
Continuously monitoring topk uncertain data streams: a probabilistic threshold method
 In ICDCS
, 2007
"... probabilistic threshold method ..."
(Show Context)