Results 1  10
of
63
A Unified Approach to Ranking in Probabilistic Databases
"... The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in ..."
Abstract

Cited by 62 (3 self)
 Add to MetaCart
(Show Context)
The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decisionmaking over such data. In this paper, we present a unified approach to ranking and topk query processing in probabilistic databases by viewing it as a multicriteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called P RF ω and P RF e, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functionsbased algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially P RF e, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking. 1.
kNearest Neighbors in Uncertain Graphs
"... Complex networks, such as biological, social, and communication networks, often entail uncertainty, and thus, can be modeled as probabilistic graphs. Similar to the problem of similarity search in standard graphs, a fundamental problem for probabilistic graphs is to efficiently answer knearest neig ..."
Abstract

Cited by 33 (4 self)
 Add to MetaCart
Complex networks, such as biological, social, and communication networks, often entail uncertainty, and thus, can be modeled as probabilistic graphs. Similar to the problem of similarity search in standard graphs, a fundamental problem for probabilistic graphs is to efficiently answer knearest neighbor queries (kNN), which is the problem of computing the k closest nodes to some specific node. In this paper we introduce a framework for processing kNN queries in probabilistic graphs. We propose novel distance functions that extend wellknown graph concepts, such as shortest paths. In order to compute them in probabilistic graphs, we design algorithms based on sampling. During kNN query processing we efficiently prune the search space using novel techniques. Our experiments indicate that our distance functions outperform previously used alternatives in identifying true neighbors in realworld biological data. We also demonstrate that our algorithms scale for graphs with tens of millions of edges. 1.
Consensus answers for queries over probabilistic databases
 in PODS
, 2009
"... We address the problem of finding a “best ” deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (a ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
We address the problem of finding a “best ” deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen as a generalization of the wellstudied inconsistent information aggregation problems (e.g. rank aggregation) to probabilistic databases. We consider this problem for various types of queries including SPJ queries, Topk ranking queries, groupby aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NPhardness). Most of our results are for a general probabilistic database model, called and/xor tree model, which significantly generalizes previous probabilistic database models like xtuples and blockindependent disjoint models, and is of independent interest.
Shape Fitting on Point Sets with Probability Distributions
"... Abstract. We consider problems on data sets where each data point has uncertainty described by an individual probability distribution. We develop several frameworks and algorithms for calculating statistics on these uncertain data sets. Our examples focus on geometric shape fitting problems. We prov ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
Abstract. We consider problems on data sets where each data point has uncertainty described by an individual probability distribution. We develop several frameworks and algorithms for calculating statistics on these uncertain data sets. Our examples focus on geometric shape fitting problems. We prove approximation guarantees for the algorithms with respect to the full probability distributions. We then empirically demonstrate that our algorithms are simple and practical, solving for a constant hidden by asymptotic analysis so that a user can reliably trade speed and size for accuracy. 1
Mining Sequential Patterns from Probabilistic Databases by PatternGrowth
"... Abstract. We propose a patterngrowth approach for mining sequential patterns from probabilistic databases. Our considered model of uncertainty is about the situations where there is uncertainty in associating an event with a source; and consider the problem of enumerating all sequences whose expect ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
(Show Context)
Abstract. We propose a patterngrowth approach for mining sequential patterns from probabilistic databases. Our considered model of uncertainty is about the situations where there is uncertainty in associating an event with a source; and consider the problem of enumerating all sequences whose expected support satisfies a userdefined threshold θ. In an earlier work [Muzammal and Raman, PAKDD’11], adapted representative candidate generateandtest approaches, GSP (breadthfirst sequence lattice traversal) and SPADE/SPAM (depthfirst sequence lattice traversal) to the probabilistic case. The authors also noted the difficulties in generalizing PrefixSpan to the probabilistic case (PrefixSpan is a patterngrowth algorithm, considered to be the best performer for deterministic sequential pattern mining). We overcome these difficulties in this note and adapt PrefixSpan to work under probabilistic settings. We then report on an experimental evaluation of the candidate generateandtest approaches against the patterngrowth approach.
Efficient and Effective Similarity Search over Probabilistic Data based on Earth Mover’s Distance
, 2010
"... Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional re ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional relational database. The problem stems from the limited effectiveness of the distance metric supported by the existing database system. On the other hand, some complicated distance operators have proven their values for better distinguishing ability in the probabilistic domain. In this paper, we discuss the similarity search problem with the Earth Mover’s Distance, which is the most successful distance metric on probabilistic histograms and an expensive operator with cubic complexity. We present a new database approach to answer range queries and knearest neighbor queries on probabilistic data, on the basis of Earth Mover’s Distance. Our solution utilizes the primaldual theory in linear programming and deploys B + tree index structures for effective candidate pruning. Extensive experiments show that our proposal dramatically improves the scalability of probabilistic databases. 1
Querying uncertain spatiotemporal data
 In Proc. ICDE
, 2012
"... Abstract — The problem of modeling and managing uncertain data has received a great deal of interest, due to its manifold applications in spatial, temporal, multimedia and sensor databases. There exists a wide range of work covering spatial uncertainty in the static (snapshot) case, where only one p ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
(Show Context)
Abstract — The problem of modeling and managing uncertain data has received a great deal of interest, due to its manifold applications in spatial, temporal, multimedia and sensor databases. There exists a wide range of work covering spatial uncertainty in the static (snapshot) case, where only one point of time is considered. In contrast, the problem of modeling and querying uncertain spatiotemporal data has only been treated as a simple extension of the spatial case, disregarding time dependencies between consecutive timestamps. We present a framework for efficiently modeling and querying uncertain spatiotemporal data. The key idea of our approach is to model possible object trajectories by stochastic processes. This approach has three major advantages over previous work. First it allows answering queries in accordance with the possible worlds model. Second, dependencies between object locations at consecutive points in time are taken into account. And third it is possible to reduce all queries on this model to simple matrix multiplications. Based on these concepts we propose efficient solutions for different probabilistic spatiotemporal queries for a particular stochastic process, the Markov chain. In an experimental evaluation we show that our approaches are several order of magnitudes faster than stateoftheart competitors. I.
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
, 2010
"... Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is e ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two wellknown scoring functions on graphs, namely graph reliability (which is #Phard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without selfjoins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
Probabilistic Similarity Search for Uncertain Time Series
, 2009
"... A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic rang ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
A probabilistic similarity query over uncertain data assigns to each uncertain database object o a probability indicating the likelihood that o meets the query predicate. In this paper, we formalize the notion of uncertain time series and introduce two novel and important types of probabilistic range queries over uncertain time series. Furthermore, we propose an original approximate representation of uncertain time series that can be used to efficiently support both new query types by upper and lower bounding the Euclidean distance.
QuantileBased KNN Over MultiValued Objects
"... Abstract — K Nearest Neighbor search has many applications including data mining, multimedia, image processing, and monitoring moving objects. In this paper, we study the problem of KNN over multivalued objects. We aim to provide effective and efficient techniques to identify KNN sensitive to rela ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
(Show Context)
Abstract — K Nearest Neighbor search has many applications including data mining, multimedia, image processing, and monitoring moving objects. In this paper, we study the problem of KNN over multivalued objects. We aim to provide effective and efficient techniques to identify KNN sensitive to relative distributions of objects. We propose to use quantiles to summarize relativedistributionsensitive K nearest neighbors. Given a query Q and a quantile φ ∈ (0, 1], we firstly study the problem of efficiently computing K nearest objects based on a φquantile distance (e.g. median distance) from each object to Q. The second problem is to retrieve the K nearest objects to Q basedonoverall distances in the “best population ” (with a given size specified by φquantile) for each object. While the first problem can be solved in polynomial time, we show that the 2nd problem is NPhard. A set of efficient, novel algorithms have been proposed to give an exact solution for the first problem and an approximate solution for the second problem with the approximation ratio 2. Extensive experiment demonstrates that our techniques are very efficient and effective. results insensitive to relative distributions of instances of objects.