Results 1 - 10
of
68
Probabilistic skylines on uncertain data
- In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), Viena
, 2007
"... Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this pap ..."
Abstract
-
Cited by 103 (19 self)
- Add to MetaCart
Uncertain data are inherent in some important applications. Although a considerable amount of research has been dedicated to modeling uncertain data and answering some types of queries on uncertain data, how to conduct advanced analysis on uncertain data remains an open problem at large. In this paper, we tackle the problem of skyline analysis on uncertain data. We propose a novel probabilistic skyline model where an uncertain object may take a probability to be in the skyline, and a p-skyline contains all the objects whose skyline probabilities are at least p. Computing probabilistic skylines on large uncertain data sets is challenging. We develop two efficient algorithms. The bottom-up algorithm computes the skyline probabilities of some selected instances of uncertain objects, and uses those instances to prune other instances and uncertain objects effectively. The top-down algorithm recursively partitions the instances of uncertain objects into subsets, and prunes subsets and objects aggressively. Our experimental results on both the real NBA player data set and the benchmark synthetic data sets show that probabilistic skylines are interesting and useful, and our two algorithms are efficient on large data sets, and complementary to each other in performance. 1.
Sliding-window top-k queries on uncertain streams
- In VLDB 2008
"... Query processing on uncertain data streams has attracted a lot of attentions lately, due to the imprecise nature in the data generated from a variety of streaming applications, such as readings from a sensor network. However, all of the existing works on uncertain data streams study unbounded stream ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
(Show Context)
Query processing on uncertain data streams has attracted a lot of attentions lately, due to the imprecise nature in the data generated from a variety of streaming applications, such as readings from a sensor network. However, all of the existing works on uncertain data streams study unbounded streams. This paper takes the first step towards the important and challenging problem of answering sliding-window queries on uncertain data streams, with a focus on arguably one of the most important types of queries—top-k queries. The challenge of answering sliding-window top-k queries on uncertain data streams stems from the strict space and time requirements of processing both arriving and expiring tuples in high-speed streams, combined with the difficulty of coping with the exponential blowup in the number of possible worlds induced by the uncertain data model. In this paper, we design a unified framework for processing sliding-window top-k queries on uncertain streams. We show that all the existing top-k definitions in the literature can be plugged into our framework, resulting in several succinct synopses that use space much smaller than the window size, while are also highly efficient in terms of processing time. In addition to the theoretical space and time bounds that we prove for these synopses, we also present a thorough experimental report to verify their practical efficiency on both synthetic and real data. 1.
Mining Uncertain Data with Probabilistic Guarantees
"... Data uncertainty is inherent in applications such as sensor monitoring systems, location-based services, and biological databases. To manage this vast amount of imprecise information, probabilistic databases have been recently developed. In this paper, we study the discovery of frequent patterns and ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
(Show Context)
Data uncertainty is inherent in applications such as sensor monitoring systems, location-based services, and biological databases. To manage this vast amount of imprecise information, probabilistic databases have been recently developed. In this paper, we study the discovery of frequent patterns and association rules from probabilistic data under the Possible World Semantics. This is technically challenging, since a probabilistic database can have an exponential number of possible worlds. We propose two efficient algorithms, which discover frequent patterns in bottom-up and top-down manners. Both algorithms can be easily extended to discover maximal frequent patterns. We also explain how to use these patterns to generate association rules. Extensive experiments, using real and synthetic datasets, were conducted to validate the performance of our methods. Source codes and data are available at:
Static Analysis for Probabilistic Programs: Inferring Whole Program Properties from Finitely Many Paths.
"... We propose an approach for the static analysis of probabilistic programs that sense, manipulate, and control based on uncertain data. Examples include programs used in risk analysis, medical decision making and cyber-physical systems. Correctness properties of such programs take the form of queries ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
(Show Context)
We propose an approach for the static analysis of probabilistic programs that sense, manipulate, and control based on uncertain data. Examples include programs used in risk analysis, medical decision making and cyber-physical systems. Correctness properties of such programs take the form of queries that seek the probabilities of assertions over program variables. We present a static analysis approach that provides guaranteed interval bounds on the values (assertion probabilities) of such queries. First, we observe that for probabilistic programs, it is possible to conclude facts about the behavior of the entire program by choosing a finite, adequate set of its paths. We provide strategies for choosing such a set of paths and verifying its adequacy. The queries are evaluated over each path by a combination of symbolic execution and probabilistic volumebound computations. Each path yields interval bounds that can be summed up with a “coverage ” bound to yield an interval that encloses the probability of assertion for the program as a whole. We demonstrate promising results on a suite of benchmarks from many different sources including robotic manipulators and medical decision making programs.
Mining Probabilistically Frequent Sequential Patterns in Uncertain Databases
"... Data uncertainty is inherent in many real-world applications such as environmental surveillance and mobile tracking. As a result, mining sequential patterns from inaccurate data, such as sensor readings and GPS trajectories, is important for discovering hidden knowledge in such applications. Previou ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Data uncertainty is inherent in many real-world applications such as environmental surveillance and mobile tracking. As a result, mining sequential patterns from inaccurate data, such as sensor readings and GPS trajectories, is important for discovering hidden knowledge in such applications. Previous work uses expected support as the measurement of pattern frequentness, which has inherent weaknesses with respect to the underlying probability model, and is therefore ineffective for mining high-quality sequential patterns from uncertain sequence databases. In this paper, we propose to measure pattern frequentness based on the possible world semantics. We establish two uncertain sequence data models abstracted from many real-life applications involving uncertain sequence data, and formulate the problem of mining probabilistically frequent sequential patterns (or p-FSPs) from data that conform to our models. Based on the prefix-projection strategy of the famous PrefixSpan algorithm, we develop two new algorithms, collectively called U-PrefixSpan, for p-FSP mining. U-PrefixSpan effectively avoids the problem of “possible world explosion”, and when combined with our three pruning techniques and one validating technique, achieves good performance. The efficiency and effectiveness of U-PrefixSpan are verified through extensive experiments on both real and synthetic datasets.
Mining Frequent Itemsets over Uncertain Databases
"... In recent years, due to the wide applications of uncertain data, miningfrequentitemsetsoveruncertaindatabaseshasattracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding prob ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
In recent years, due to the wide applications of uncertain data, miningfrequentitemsetsoveruncertaindatabaseshasattracted much attention. In uncertain databases, the support of an itemset is a random variable instead of a fixed occurrence counting of this itemset. Thus, unlike the corresponding problem in deterministic databases where the frequent itemset has a unique definition, the frequent itemset under uncertain environments has two different definitions so far. The first definition, referred as the expected support-based frequent itemset, employs the expectation of the support of an itemset to measure whether this itemset is frequent. The second definition, referred as the probabilistic frequent itemset, uses the probability of the support of an itemset to measure its frequency. Thus, existing work on mining frequent itemsets over uncertain databases is divided into two different groups and no study is conducted to comprehensively compare the two different definitions. In addition, since no uniform experimental platform exists, current solutions for the same definition even generate inconsistent results. In this paper, we firstly aim to clarify the relationship between the two different definitions. Through extensive experiments, we verify that the two definitions have a tight connection and can be unified together when the size of data is large enough. Secondly, we provide baseline implementations of eight existing representative algorithms and test their performances with uniform measures fairly. Finally, according to the fair tests over many different benchmark data sets, we clarify several existing inconsistent conclusions and discuss some new findings. 1.
Do you know your IQ? A research agenda for information quality in systems
"... Information quality (IQ) is a measure of how fit information is for a purpose. Sometimes called Quality of Information (QoI) by analogy with Quality of Service (QoS), it quantifies whether the correct information is being used to make a decision or take an action. Failure to understand whether infor ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Information quality (IQ) is a measure of how fit information is for a purpose. Sometimes called Quality of Information (QoI) by analogy with Quality of Service (QoS), it quantifies whether the correct information is being used to make a decision or take an action. Failure to understand whether information is of adequate quality can lead to bad decisions and catastrophic effects. The results can include system outages, increased costs, lost revenue – and worse. Quantifying information quality can help improve decision making, but the ultimate goal should be to select or construct information sources that have the appropriate balance between information quality and the cost of providing it. In this paper, we provide a brief introduction to the field, argue the case for applying information quality metrics in the systems domain, and propose a research agenda to explore this space. Categories and Subject Descriptors
DUST: a generalized notion of similarity between uncertain time series
- In SIGKDD
, 2010
"... Large-scale sensor deployments and an increased use of pri-vacy-preserving transformations have led to an increasing in-terest in mining uncertain time series data. Traditional dis-tance measures such as Euclidean distance or dynamic time warping are not always effective for analyzing uncertain time ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Large-scale sensor deployments and an increased use of pri-vacy-preserving transformations have led to an increasing in-terest in mining uncertain time series data. Traditional dis-tance measures such as Euclidean distance or dynamic time warping are not always effective for analyzing uncertain time series data. Recently, some measures have been proposed to account for uncertainty in time series data. However, we show in this paper that their applicability is limited. In spe-cific, these approaches do not provide an intuitive way to compare two uncertain time series and do not easily accom-modate multiple error functions. In this paper, we provide a theoretical framework that generalizes the notion of similarity between uncertain time series. Secondly, we propose DUST, a novel distance mea-sure that accommodates uncertainty and degenerates to the Euclidean distance when the distance is large compared to the error. We provide an extensive experimental validation of our approach for the following applications: classification, top-k motif search, and top-k nearest-neighbor queries.
On the Most Likely Convex Hull of Uncertain Points
"... Abstract. Consider a set of d-dimensional points where the existence or the location of each point is determined by a probability distribution. The convex hull of this set is a random variable distributed over exponen-tially many choices. We are interested in finding the most likely convex hull, nam ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. Consider a set of d-dimensional points where the existence or the location of each point is determined by a probability distribution. The convex hull of this set is a random variable distributed over exponen-tially many choices. We are interested in finding the most likely convex hull, namely, the one with the maximum probability of occurrence. We investigate this problem under two natural models of uncertainty: the point (also called the tuple) model where each point (site) has a fixed position si but only exists with some probability pii, for 0 < pii ≤ 1, and the multipoint model where each point has multiple possible locations or it may not appear at all. We show that the most likely hull under the point model can be computed in O(n3) time for n points in d = 2 dimensions, but it is NP–hard for d ≥ 3 dimensions. On the other hand, we show that the problem is NP–hard under the multipoint model even for d = 2 dimensions. We also present hardness results for approximating the probability of the most likely hull. While we focus on the most likely hull for concreteness, our results hold for other natural definitions of a probabilistic hull. 1
The Pursuit of a Good Possible World: Extracting Representative Instances of Uncertain Graphs
"... Data in several applications can be represented as an uncertain graph, whose edges are labeled with a probability of existence. Exact query processing on uncertain graphs is prohibitive for most applications, as it involves evaluation over an exponential number of instantiations. Even approximate pr ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Data in several applications can be represented as an uncertain graph, whose edges are labeled with a probability of existence. Exact query processing on uncertain graphs is prohibitive for most applications, as it involves evaluation over an exponential number of instantiations. Even approximate processing based on sampling is usually extremely expensive since it requires a vast number of samples to achieve reasonable quality guarantees. To overcome these problems, we propose algorithms for creating deterministic representative instances of uncertain graphs that maintain the underlying graph properties. Specifically, our algorithms aim at preserving the expected vertex degrees because they capture well the graph topology. Conventional processing techniques can then be applied on these instances to closely approximate the result on the uncertain graph. We experimentally demonstrate, with real and synthetic uncertain graphs, that indeed the representative instances can be used to answer, efficiently and accurately, queries based on several properties such as shortest path distance, clustering coefficient and betweenness centrality.