• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

10^(10^6) worlds and beyond: Efficient representation and processing of incomplete information (2007)

by L Antova, C Koch, D Olteanu
Venue:In ICDE
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 64
Next 10 →

Semantics of ranking queries for probabilistic data and expected ranks

by Graham Cormode, Feifei Li, Ke Yi - In Proc. of ICDE’09 , 2009
"... Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditi ..."
Abstract - Cited by 63 (1 self) - Add to MetaCart
Abstract — When dealing with massive quantities of data, topk queries are a powerful technique for returning only the k most relevant tuples for inspection, based on a scoring function. The problem of efficiently answering such ranking queries has been studied and analyzed extensively within traditional database settings. The importance of the top-k is perhaps even greater in probabilistic databases, where a relation can encode exponentially many possible worlds. There have been several recent attempts to propose definitions and algorithms for ranking queries over probabilistic data. However, these all lack many of the intuitive properties of a top-k over deterministic data. Specifically, we define a number of fundamental properties, including exact-k, containment, unique-rank, value-invariance, and stability, which are all satisfied by ranking queries on certain data. We argue that all these conditions should also be fulfilled by any reasonable definition for ranking uncertain data. Unfortunately, none of the existing definitions is able to achieve this. To remedy this shortcoming, this work proposes an intuitive new approach of expected rank. This uses the well-founded notion of the expected rank of each tuple across all possible worlds as the basis of the ranking. We are able to prove that, in contrast to all existing approaches, the expected rank satisfies all the required properties for a ranking query. We provide efficient solutions to compute this ranking across the major models of uncertain data, such as attribute-level and tuple-level uncertainty. For an uncertain relation of N tuples, the processing cost is O(N log N)—no worse than simply sorting the relation. In settings where there is a high cost for generating each tuple in turn, we provide pruning techniques based on probabilistic tail bounds that can terminate the search early and guarantee that the top-k has been found. Finally, a comprehensive experimental study confirms the effectiveness of our approach. I.

Efficient processing of top-k queries on uncertain databases

by Ke Yi, Feifei Li, George Kollios, Divesh Srivastava , 2007
"... Abstract — This work introduces novel polynomial-time algorithms for processing top-k queries in uncertain databases, under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. ..."
Abstract - Cited by 62 (7 self) - Add to MetaCart
Abstract — This work introduces novel polynomial-time algorithms for processing top-k queries in uncertain databases, under the generally adopted model of x-relations. An x-relation consists of a number of x-tuples, and each x-tuple randomly instantiates into one tuple from one or more alternatives. Our results significantly improve the best known algorithms for top-k query processing in uncertain databases, in terms of both running time and memory usage. Focusing on the single-alternative case, the new algorithms are orders of magnitude faster. I.

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

by Daisy Zhe Wang, Eirinaios Michelakis, Minos Garofalakis, Joseph M. Hellerstein
"... Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and hum ..."
Abstract - Cited by 60 (1 self) - Add to MetaCart
Several real-world applications need to effectively manage and reason about large amounts of data that are inherently uncertain. For instance, pervasive computing applications must constantly reason about volumes of noisy sensory readings for a variety of reasons, including motion prediction and human behavior modeling. Such probabilistic data analyses require sophisticated machine-learning tools that can effectively model the complex spatio/temporal correlation patterns present in uncertain sensory data. Unfortunately, to date, most existing approaches to probabilistic database systems have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures: Probabilistic information is typically associated with individual data tuples, with only limited or no support for effectively capturing and reasoning about complex data correlations. In this paper, we introduce BAYESSTORE, a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. Adopting a machine-learning view, BAYESSTORE employs concise statistical relational models to effectively encode the correlation patterns between uncertain data, and promotes probabilistic inference and statistical model manipulation as part of the standard DBMS operator repertoire to support efficient and sound query processing. We present BAYESSTORE’s uncertainty model based on a novel, first-order statistical model, and we redefine traditional query processing operators, to manipulate the data and the probabilistic models of the database in an efficient manner. Finally, we validate our approach, by demonstrating the value of exploiting data correlations during query processing, and by evaluating a number of optimizations which significantly accelerate query processing. 1
(Show Context)

Citation Context

...gement, such tools are not targeted at the declarative management and processing of largescale data sets. Since the early 80’s, a number of PDBSs have been proposed in an effort to address this issue =-=[12, 5, 3, 10, 7, 4, 17, 2, 1]-=-. Moving away from statistical approaches, this work extends the relational model with probabilistic information captured at the level of individual tuple existence (i.e., a tuple may or may not exis...

Finding Frequent Items in Probabilistic Data

by Qin Zhang, Feifei Li, Ke Yi , 2008
"... Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One ..."
Abstract - Cited by 42 (5 self) - Add to MetaCart
Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One straightforward approach to identify the frequent items in a probabilistic data set is to simply compute the expected frequency of an item and decide if it exceeds a certain fraction of the expected size of the whole data set. However, this simple definition misses important information about the internal structure of the probabilistic data and the interplay among all the uncertain entities. Thus, we propose a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world. Our approach naturally leads to the study of ranking frequent items based on confidence as well. Finding likely frequent items in probabilistic data turns out to be much more difficult. We first propose exact algorithms for offline data that run in either quadratic or cubic time. Next, we design novel sampling-based algorithms for streaming data to find all approximately likely frequent items with theoretically guaranteed high probability and accuracy. Our sampling schemes consume sublinear memory and exhibit excellent scalability. Finally, we verify the effectiveness and efficiency of the developed algorithms using both real and synthetic data sets with extensive experimental evaluations.
(Show Context)

Citation Context

...). 6. RELATED WORK Many efforts have been devoted to modeling and processing uncertain data and a complete survey of this area is beyond the scope of this paper. Nevertheless, TRIO [2, 7, 36], MayBMS =-=[5, 6]-=- and Probabilistic Databases [13] are promising systems that are currently under developing. General query processing techniques have been extensively studied under the possible worlds semantics [9, 1...

World-set decompositions: Expressiveness and efficient algorithms

by Lyublena Antova, Christoph Koch, Dan Olteanu - In Proc. ICDT , 2007
"... Abstract. Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSD ..."
Abstract - Cited by 38 (12 self) - Add to MetaCart
Abstract. Uncertain information is commonplace in real-world data management scenarios. The ability to represent large sets of possible instances (worlds) while supporting efficient storage and processing is an important challenge in this context. The recent formalism of worldset decompositions (WSDs) provides a space-efficient representation for uncertain data that also supports scalable processing. WSDs are complete for finite world-sets in that they can represent any finite set of possible worlds. For possibly infinite world-sets, we show that a natural generalization of WSDs precisely captures the expressive power of c-tables. We then show that several important problems are efficiently solvable on WSDs while they are NP-hard on c-tables. Finally, we give a polynomial-time algorithm for factorizing WSDs, i.e. an efficient algorithm for minimizing such representations. 1
(Show Context)

Citation Context

...systems for finite sets of possible worlds. The approach of the Trio x-relations [8] relies on a form of intensional information (“lineage”) only in combination with which the formalism is strong. In =-=[5]-=- large sets of possible worlds are managed using world-set decompositions (WSDs). The approach is based on relational product decomposition to permit space-efficient representation. [5] describes a pr...

Provenance for aggregate queries

by Yael Amsterdamer, Daniel Deutch, Val Tannen , 2011
"... ..."
Abstract - Cited by 37 (11 self) - Add to MetaCart
Abstract not found

From Complete to Incomplete Information and Back

by Lyublena Antova, Christoph Koch, Dan Olteanu - In Proc. SIGMOD
"... Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semanti ..."
Abstract - Cited by 37 (11 self) - Add to MetaCart
Incomplete information arises naturally in numerous data management applications. Recently, several researchers have studied query processing in the context of incomplete information. Most work has combined the syntax of a traditional query language like relational algebra with a nonstandard semantics such as certain or ranked possible answers. There are now also languages with special features to deal with uncertainty. However, to the standards of the data management community, to date no language proposal has been made that can be considered a natural analog to SQL or relational algebra for the case of incomplete information. In this paper we propose such a language, World-set Algebra, which satisfies the robustness criteria and analogies to relational algebra that we expect. The language supports the contemplation on alternatives and can thus map from a complete database to an incomplete one comprising several possible worlds. We show that World-set Algebra is conservative over relational algebra in the sense that any query that maps from a complete database to a complete database (a complete-to-complete query) is equivalent to a relational algebra query. Moreover, we give an efficient algorithm for effecting this translation. We then study algebraic query optimization of such queries. We argue that query languages with explicit constructs for handling uncertainty allow for the more natural and simple expression of many real-world decision support queries. The results of this paper not only suggest a language for specifying queries in this way, but also allow for their efficient evaluation in any relational database management system.
(Show Context)

Citation Context

...[18], data cleaning [3], or data exchange [11]. In the last decades the research community has shown a vivid interest in efficiently managing incomplete information viewed as a set of possible worlds =-=[16, 12, 17, 2, 13, 19, 6, 7, 10, 11, 14, 8, 4]-=-. When it comes to expressing queries on incomplete information, these contributions mostly consider standard languages for complete data such as relational algebra or SQL. While [16] uses a compositi...

The ORCHESTRA collaborative data sharing system

by Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Val Tannen, Partha Pratim, Talukdar Marie, Jacob Fern, O Pereira - SIGMOD Record
"... Sharing structured data today requires standardizing upon a single schema, then mapping and cleaning all of the data. This results in a single queriable mediated data instance. However, for settings in which structured data is being collaboratively authored by a large community, e.g., in the science ..."
Abstract - Cited by 34 (6 self) - Add to MetaCart
Sharing structured data today requires standardizing upon a single schema, then mapping and cleaning all of the data. This results in a single queriable mediated data instance. However, for settings in which structured data is being collaboratively authored by a large community, e.g., in the sciences, there is often a lack of consensus about how it should be represented, what is correct, and which sources are authoritative. Moreover, such data is seldom static: it is frequently updated, cleaned, and annotated. The ORCHESTRA collaborative data sharing system develops a new architecture and consistency model for such settings, based on the needs of data sharing in the life sciences. In this paper we describe the basic architecture and implementation of the ORCHESTRA system, and summarize some of the open challenges that arise in this setting. 1
(Show Context)

Citation Context

... supporting multiple schemas, are not flexible enough to meet life scientists’ needs for managing data importation, updates, and inconsistent data. Recent proposals for probabilistic database systems =-=[2, 4, 11, 34]-=- manage uncertainty within a single database instance, but do not help with integration across multiple databases or management of consistency and reconciliation of conflicts. In order to provide coll...

Database Support for Probabilistic Attributes and Tuples

by Sarvjeet Singh, Chris Mayfield, Rahul Shah, Sunil Prabhakar, Susanne Hambrusch, Jennifer Neville, Reynold Cheng - In IEEE 24th Intl. Conference on Data Engineering , 2008
"... Abstract — The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous an ..."
Abstract - Cited by 33 (6 self) - Add to MetaCart
Abstract — The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous and discrete data domains. This paper presents a model for handling arbitrary probabilistic uncertain data (both discrete and continuous) natively at the database level. Our approach leads to a natural and efficient representation for probabilistic data. We develop a model that is consistent with possible worlds semantics and closed under basic relational operators. This is the first model that accurately and efficiently handles both continuous and discrete uncertainty. The model is implemented in a real database system (PostgreSQL) and the effectiveness and efficiency of our approach is validated experimentally. I.
(Show Context)

Citation Context

...epresent their dependencies. Antova et al. developed a compact representation called world-set decompositions which captures the correlations in the database by representing the finite sets of worlds =-=[17]-=-. Dalvi et al. introduced safe plans [18], [10] in an attempt to avoid probabilistic dependencies in queries. An important area of uncertain reasoning and modeling deals with fuzzy sets [1]. The work ...

Query Processing over Incomplete Autonomous Databases

by Garrett Wolf, Hemal Khatri, Bhaumik Chokshi, Jianchun Fan, Yi Chen, Subbarao Kambhampati
"... Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical ..."
Abstract - Cited by 25 (5 self) - Add to MetaCart
Incompleteness due to missing attribute values (aka “null values”) is very common in autonomous web databases, on which user accesses are usually supported through mediators. Traditional query processing techniques that focus on the strict soundness of answer tuples often ignore tuples with critical missing attributes, even if they wind up being relevant to a user query. Ideally we would like the mediator to retrieve such possible answers and gauge their relevance by accessing their likelihood of being pertinent answers to the query. The autonomous nature of web databases poses several challenges in realizing this objective. Such challenges include the restricted access privileges imposed on the data, the limited support for query patterns, and the bounded pool of database and network resources in the web environment. We introduce a novel query rewriting and optimization framework QPIAD that tackles these challenges. Our technique involves reformulating the user query based on mined correlations among the database attributes. The reformulated queries are aimed at retrieving the relevant possible answers in addition to the certain answers. QPIAD is able to gauge the relevance of such queries allowing tradeoffs in reducing the costs of database query processing and answer transmission. To support this framework, we develop methods for mining attribute correlations (in terms of Approximate Functional Dependencies), value distributions (in the form of Naïve Bayes Classifiers), and selectivity estimates. We present empirical studies to demonstrate that our approach is able to effectively retrieve relevant possible answers with high precision, high recall, and manageable cost. 1.
(Show Context)

Citation Context

...bases may contain missing values. Examples of this include imperfections in web page segmentation (as described in [11]) or imperfections in scanning and converting handwritten forms (as described in =-=[2]-=-). Heterogenous Schemas: Global schemas provided by mediator systems may often contain attributes that do not appear in all of the local schemas. For example, a global schema for the used car trading ...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University