Results 1 - 10
of
13
Indexing Correlated Probabilistic Databases
"... With large amounts of correlated probabilistic data being generated in a wide range of application domains including sensor networks, information extraction, event detection etc., effectively managing and querying them has become an important research direction. While there is an exhaustive body of ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
With large amounts of correlated probabilistic data being generated in a wide range of application domains including sensor networks, information extraction, event detection etc., effectively managing and querying them has become an important research direction. While there is an exhaustive body of literature on querying independent probabilistic data, supporting efficient queries over large-scale, correlated databases remains a challenge. In this paper, we develop efficient data structures and indexes for supporting inference and decision support queries over such databases. Our proposed hierarchical data structure is suitable both for inmemory and disk-resident databases. We represent the correlations in the probabilistic database using a junction tree over the tuple-existence or attribute-value random variables, and use tree partitioning techniques to build an index structure over it. We show how to efficiently answer inference and aggregation queries using such an index, resulting in orders of magnitude performance benefits in most cases. In addition, we develop novel algorithms for efficiently keeping the index structure up-to-date as changes (inserts, updates) are made to the probabilistic database. We present a comprehensive experimental study illustrating the benefits of our approach to query processing in probabilistic databases.
Lineage processing over correlated probabilistic databases
- In SIGMOD Conference
, 2010
"... In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute lineages of the output tuples, and then compute the probabili ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute lineages of the output tuples, and then compute the probabilities of the lineage formulas. However unlike previous work, we allow for arbitrary and complex correlations to be present in the data, captured via a forest of junction trees. We observe that evaluating even read-once (tree structured) lineages (e.g., those generated by hierarchical conjunctive queries), polynomially computable over tuple independent probabilistic databases, is #P-complete for lightly correlated probabilistic databases like Markov sequences. We characterize the complexity of exact computation of the probability of the lineage formula on a correlated database using a parameter called lwidth (analogous to the notion of treewidth). For lineages that result in low lwidth, we compute exact probabilities using a novel message passing algorithm, and for lineages that induce large lwidths, we develop approximate Monte Carlo algorithms to estimate the result probabilities. We scale our algorithms to very large correlated probabilistic databases using the previously proposed INDSEP data structure. To mitigate the complexity of lineage evaluation, we develop optimization techniques to process a batch of lineages by sharing computation across formulas, and to exploit any independence relationships that may exist in the data. Our experimental study illustrates the benefits of using our algorithms for processing lineage formulas over correlated probabilistic databases.
Longitudinal Study of a Building-Scale RFID Ecosystem
- In MobiSys 2009
, 2009
"... Radio Frequency IDentification (RFID) deployments are becoming increasingly popular in both industrial and consumer-oriented settings. To effectively exploit and operate such deployments, important challenges must be addressed, from managing RFID data streams to handling limitations in reader accura ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Radio Frequency IDentification (RFID) deployments are becoming increasingly popular in both industrial and consumer-oriented settings. To effectively exploit and operate such deployments, important challenges must be addressed, from managing RFID data streams to handling limitations in reader accuracy and coverage. Furthermore, deployments that support pervasive computing raise additional issues related to user acceptance and system utility. To better understand these challenges, we conducted a four-week study of a building-scale EPC Class-1 Generation-2 RFID deployment, the “RFID Ecosystem”, with 47 readers (160 antennas) installed throughout an 8,000 square meter building. During the study, 67 participants having over 300 tags accessed the collected RFID data through applications including an object finder and a friend tracker and several tools for managing personal data. We found that our RFID deployment produces a very manageable amount of data overall, but with orders of magnitude difference among various participants and objects. We also find that the tag detection rates tend to be low with high variance across the type of tag, participant and object. Users need expert guidance to effectively mount their tags and are encouraged by compelling applications to wear tags more frequently. Finally, probabilistic modeling and inference techniques promise to enable more complex applications by smoothing over gaps and errors in the data, but must be applied with care as they add significant computational and storage overhead.
Histograms and wavelets on probabilistic data
- In ICDE
, 2009
"... Abstract — There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurat ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract — There is a growing realization that uncertain information is a first-class citizen in modern database management. As such, we need techniques to correctly and efficiently process uncertain data in database systems. In particular, data reduction techniques that can produce concise, accurate synopses of large probabilistic relations are crucial. Similar to their deterministic relation counterparts, such compact probabilistic data synopses can form the foundation for human understanding and interactive data exploration, probabilistic query planning and optimization, and fast approximate query processing in probabilistic database systems. In this paper, we introduce definitions and algorithms for building histogram- and Haar wavelet-based synopses on probabilistic data. The core problem is to choose a set of histogram bucket boundaries or wavelet coefficients to optimize the accuracy of the approximate representation of a collection of probabilistic tuples under a given error metric. For a variety of different error metrics, we devise efficient algorithms that construct optimal or near optimal size B histogram and wavelet synopses. This requires careful analysis of the structure of the probability distributions, and novel extensions of known dynamicprogramming-based techniques for the deterministic domain. Our experiments show that this approach clearly outperforms simple ideas, such as building summaries for samples drawn from the data distribution, while taking equal or less time. I.
Lahar Demonstration: Warehousing Markovian Streams
, 2009
"... Lahar is a warehousing system for Markovian streams—a common class of uncertain data streams produced via inference on probabilistic models. Example Markovian streams include text inferred from speech, location streams inferred from GPS or RFID readings, and human activity streams inferred from sens ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Lahar is a warehousing system for Markovian streams—a common class of uncertain data streams produced via inference on probabilistic models. Example Markovian streams include text inferred from speech, location streams inferred from GPS or RFID readings, and human activity streams inferred from sensor data. Lahar supports OLAP-style queries on Markovian stream archives by leveraging novel approximation and indexing techniques that efficiently manipulate stream probabilities. This demonstration allows users to interactively query a warehouse of imprecise text streams inferred automatically from audio podcasts. Through this interaction, the demo introduces users to the challenges of Markovian stream processing as well as technical contributions developed to address these challenges.
Approximation Trade-Offs in Markovian Stream Processing: An Empirical Study
"... Abstract — A large amount of the world’s data is both sequential and imprecise. Such data is commonly modeled as Markovian streams; examples include words/sentences inferred from raw audio signals, or discrete location sequences inferred from RFID or GPS data. The rich semantics and large volumes of ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract — A large amount of the world’s data is both sequential and imprecise. Such data is commonly modeled as Markovian streams; examples include words/sentences inferred from raw audio signals, or discrete location sequences inferred from RFID or GPS data. The rich semantics and large volumes of these streams make them difficult to query efficiently. In this paper, we study the effects—on both efficiency and accuracy—of two common stream approximations. Through experiments on a realworld RFID data set, we identify conditions under which these approximations can improve performance by several orders of magnitude, with only minimal effects on query results. We also identify cases when the full rich semantics are necessary. I.
A Generic Framework for Handling Uncertain Data with Local Correlations
"... Data uncertainty is ubiquitous in many real-world applications such as sensor/RFID data analysis. In this paper, we investigate uncertain data that exhibit local correlations, that is, each uncertain object is only locally correlated with a small subset of data, while being independent of others. We ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Data uncertainty is ubiquitous in many real-world applications such as sensor/RFID data analysis. In this paper, we investigate uncertain data that exhibit local correlations, that is, each uncertain object is only locally correlated with a small subset of data, while being independent of others. We propose a generic framework for dealing with this kind of uncertain and locally correlated data, in which we investigate a classical spatial query, nearest neighbor query, on uncertain data with local correlations (namely LC-PNN). Most importantly, to enable fast LC-PNN query processing, we propose a novel filtering technique via offline pre-computations to reduce the query search space. We demonstrate through extensive experiments the efficiency and effectiveness of our approaches. 1.
Creating Probabilistic Databases from Imprecise Time-Series Data
"... Abstract—Although efficient processing of probabilistic databases is a well-established field, a wide range of applications are still unable to benefit from these techniques due to the lack of means for creating probabilistic databases. In fact, it is a challenging problem to associate concrete prob ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Although efficient processing of probabilistic databases is a well-established field, a wide range of applications are still unable to benefit from these techniques due to the lack of means for creating probabilistic databases. In fact, it is a challenging problem to associate concrete probability values with given time-series data for forming a probabilistic database, since the probability distributions used for deriving such probability values vary over time. In this paper, we propose a novel approach to create tuple-level probabilistic databases from (imprecise) time-series data. To the best of our knowledge, this is the first work that introduces a generic solution for creating probabilistic databases from arbitrary time series, which can work in online as well as offline fashion. Our approach consists of two key components. First, the dynamic density metrics that infer time-dependent probability distributions for time series, based on various mathematical models. Our main metric, called the GARCH metric, can robustly capture such evolving probability distributions regardless of the presence of erroneous values in a given time series. Second, the Ω–View builder that creates probabilistic databases from the probability distributions inferred by the dynamic density metrics. For efficient processing, we introduce the σ–cache that reuses the information derived from probability values generated at previous times. Extensive experiments over real datasets demonstrate the effectiveness of our approach. I.
Transducing Markov Sequences Extended Abstract
"... A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a ..."
Abstract
- Add to MetaCart
A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any sub-exponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.

