Results 1  10
of
62
kNearest Neighbors in Uncertain Graphs
"... Complex networks, such as biological, social, and communication networks, often entail uncertainty, and thus, can be modeled as probabilistic graphs. Similar to the problem of similarity search in standard graphs, a fundamental problem for probabilistic graphs is to efficiently answer knearest neig ..."
Abstract

Cited by 33 (4 self)
 Add to MetaCart
Complex networks, such as biological, social, and communication networks, often entail uncertainty, and thus, can be modeled as probabilistic graphs. Similar to the problem of similarity search in standard graphs, a fundamental problem for probabilistic graphs is to efficiently answer knearest neighbor queries (kNN), which is the problem of computing the k closest nodes to some specific node. In this paper we introduce a framework for processing kNN queries in probabilistic graphs. We propose novel distance functions that extend wellknown graph concepts, such as shortest paths. In order to compute them in probabilistic graphs, we design algorithms based on sampling. During kNN query processing we efficiently prune the search space using novel techniques. Our experiments indicate that our distance functions outperform previously used alternatives in identifying true neighbors in realworld biological data. We also demonstrate that our algorithms scale for graphs with tens of millions of edges. 1.
Consensus answers for queries over probabilistic databases
 in PODS
, 2009
"... We address the problem of finding a “best ” deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (a ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
We address the problem of finding a “best ” deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen as a generalization of the wellstudied inconsistent information aggregation problems (e.g. rank aggregation) to probabilistic databases. We consider this problem for various types of queries including SPJ queries, Topk ranking queries, groupby aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NPhardness). Most of our results are for a general probabilistic database model, called and/xor tree model, which significantly generalizes previous probabilistic database models like xtuples and blockindependent disjoint models, and is of independent interest.
Relevance and ranking in online dating systems
 SIGIR, SIGIR
, 2010
"... Matchmaking systems refer to systems where users want to meet other individuals to satisfy some underlying need. Examples of matchmaking systems include dating services, resume/job bulletin boards, community based question answering, and consumertoconsumer marketplaces. One fundamental component ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
Matchmaking systems refer to systems where users want to meet other individuals to satisfy some underlying need. Examples of matchmaking systems include dating services, resume/job bulletin boards, community based question answering, and consumertoconsumer marketplaces. One fundamental component of a matchmaking system is the retrieval and ranking of candidate matches for a given user. We present the first indepth study of information retrieval approaches applied to matchmaking systems. Specifically, we focus on retrieval for a dating service. This domain offers several unique problems not found in traditional information retrieval tasks. These include twosided relevance, very subjective relevance, extremely few relevant matches, and structured queries. We propose a machine learned ranking function that makes use of features extracted from the uniquely rich user profiles that consist of both structured and unstructured attributes. An extensive evaluation carried out using data gathered from a real online dating service shows the benefits of our proposed methodology with respect to traditional matchmaking baseline systems. Our analysis also provides deep insights into the aspects of matchmaking that are particularly important for producing highly relevant matches.
Efficient and Effective Similarity Search over Probabilistic Data based on Earth Mover’s Distance
, 2010
"... Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional re ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Probabilistic data is coming as a new deluge along with the technical advances on geographical tracking, multimedia processing, sensor network and RFID. While similarity search is an important functionality supporting the manipulation of probabilistic data, it raises new challenges to traditional relational database. The problem stems from the limited effectiveness of the distance metric supported by the existing database system. On the other hand, some complicated distance operators have proven their values for better distinguishing ability in the probabilistic domain. In this paper, we discuss the similarity search problem with the Earth Mover’s Distance, which is the most successful distance metric on probabilistic histograms and an expensive operator with cubic complexity. We present a new database approach to answer range queries and knearest neighbor queries on probabilistic data, on the basis of Earth Mover’s Distance. Our solution utilizes the primaldual theory in linear programming and deploys B + tree index structures for effective candidate pruning. Extensive experiments show that our proposal dramatically improves the scalability of probabilistic databases. 1
Dissociation and Propagation for Efficient Query Evaluation over Probabilistic Databases
, 2010
"... Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is e ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
Queries over probabilistic databases are either safe, in which case they can be evaluated entirely in a relational database engine, or unsafe, in which case they need to be evaluated with a generalpurpose inference engine at a high cost. This paper proposes a new approach by which every query is evaluated like a safe query inside the database engine, by using a new method called dissociation. A dissociated query is obtained by adding extraneous variables to some atoms until the query becomes safe. We show that the probability of the original query and that of the dissociated query correspond to two wellknown scoring functions on graphs, namely graph reliability (which is #Phard), and the propagation score (which is related to PageRank and is in PTIME): When restricted to graphs, standard query probability is graph reliability, while the dissociated probability is the propagation score. We define a propagation score for conjunctive queries without selfjoins and prove (i) that it is is always an upper bound for query reliability, and (ii) that both scores coincide for all safe queries. Given the widespread and successful use of graph propagation methods in practice, we argue for the dissociation method as a good and efficient way to rank probabilistic query results, especially for those queries which are highly intractable for exact probabilistic inference.
Database Foundations for Scalable RDF Processing
 In Reasoning Web
"... Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce stateoftheart techniques for scalably storing and querying RDF with relatio ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce stateoftheart techniques for scalably storing and querying RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As centralized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query processing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chapter, we argue that extracting knowledge from the Web is an excellent showcase – and potentially one of the biggest challenges – for the scal
(Approximate) uncertain skylines
 IN ICDT
, 2011
"... Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point’s uncertainty is described as a probability distribution over a discrete ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Given a set of points with uncertain locations, we consider the problem of computing the probability of each point lying on the skyline, that is, the probability that it is not dominated by any other input point. If each point’s uncertainty is described as a probability distribution over a discrete set of locations, we improve the best known exact solution. We also suggest why we believe our solution might be optimal. Next, we describe simple, nearlinear time approximation algorithms for computing the probability of each point lying on the skyline. In addition, some of our methods can be adapted to construct data structures that can efficiently determine the probability of a query point lying on the skyline.
Efficient Probabilistic Reverse Nearest Neighbor Query Processing on Uncertain Data
"... Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Given a query object q, a reverse nearest neighbor (RNN) query in a common certain database returns the objects having q as their nearest neighbor. A new challenge for databases is dealing with uncertain objects. In this paper we consider probabilistic reverse nearest neighbor (PRNN) queries, which return the uncertain objects having the query object as nearest neighbor with a sufficiently high probability. We propose an algorithm for efficiently answering PRNN queries using new pruning mechanisms taking distance dependencies into account. We compare our algorithm to stateoftheart approaches recently proposed. Our experimental evaluation shows that our approach is able to significantly outperform previous approaches. In addition, we show how our approach can easily be extended to PRkNN (where k> 1) query processing for which there is currently no efficient solution. 1.
Topk Query Processing in Probabilistic Databases with NonMaterialized Views
, 2012
"... In this paper, we investigate a novel approach of computing confidence bounds for topk ranking queries in probabilistic databases with nonmaterialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the topranked query answers according to their marginal probabil ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
In this paper, we investigate a novel approach of computing confidence bounds for topk ranking queries in probabilistic databases with nonmaterialized views. Unlike prior approaches, we present an exact pruning algorithm for finding the topranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of selectprojectjoin views, the latter of which are cast into Datalog rules, where also the rules themselves may be uncertain, i.e., be valid with some degree of confidence. To our knowledge, this work is the first to address integrated data and confidence computations in the context of probabilistic databases by considering confidence bounds over partially evaluated query answers with firstorder lineage formulas. We further extend our query processing techniques by a toolsuite of scheduling strategies based on selectivity estimation and the expected impact of subgoals on the final confidence of answer candidates. Experiments with large datasets demonstrate drastic runtime improvements over both sampling and decompositionbased methods—even