Results 1 -
9 of
9
Aggregate Queries for Discrete and Continuous Probabilistic XML ∗
"... Sources of data uncertainty and imprecision are numerous. A way to handle this uncertainty is to associate probabilistic annotations to data. Many such probabilistic database models have been proposed, both in the relational and in the semi-structured setting. The latter is particularly well adapted ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
Sources of data uncertainty and imprecision are numerous. A way to handle this uncertainty is to associate probabilistic annotations to data. Many such probabilistic database models have been proposed, both in the relational and in the semi-structured setting. The latter is particularly well adapted to the management of uncertain data coming from a variety of automatic processes. An important problem, in the context of probabilistic XML databases, is that of answering aggregate queries (count, sum, avg, etc.), which has received limited attention so far. In a model unifying the various (discrete) semi-structured probabilistic models studied up to now, we present algorithms to compute the distribution of the aggregation values (exploiting some regularity properties of the aggregate functions) and probabilistic moments (especially, expectation and variance) of this distribution. We also prove the intractability of some of these problems and investigate approximation techniques. We finally extend the discrete model to a continuous one, in order to take into account continuous data values, such as measurements from sensor networks, and present algorithms to compute distribution functions and moments for various classes of continuous distributions of data values.
Generating, sampling and counting subclasses of regular tree languages
"... To experimentally validate learning and approximation algorithms for XML Schema Definitions (XSDs), we need algorithms to generate uniformly at random a corpus of XSDs as well as a similarity measure to compare how close the generated XSD resembles the target schema. In this paper, we provide the fo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
To experimentally validate learning and approximation algorithms for XML Schema Definitions (XSDs), we need algorithms to generate uniformly at random a corpus of XSDs as well as a similarity measure to compare how close the generated XSD resembles the target schema. In this paper, we provide the formal foundation for such a testbed. We adopt similarity measures based on counting the number of common and different trees in the two languages, and we develop the necessary machinery for computing them. We use the formalism of extended DTDs (EDTDs) to represent the unranked regular tree languages. In particular, we obtain an efficient algorithm to count the number of trees up to a certain size in an unambiguous EDTD. The latter class of unambiguous EDTDs encompasses the more familiar classes of single-type, restrained competition and bottom-up deterministic EDTDs. The single-type EDTDs correspond precisely to the core of XML Schema, while the others are strictly more expressive. We also show how constraints on the shape of allowed trees can be incorporated. As we make use of a translation into a well-known formalism for combinatorial specifications, we get for free a sampling procedure to draw members of any unambiguous EDTD. When dropping the restriction to unambiguous EDTDs, i.e. taking the full class of EDTDs into account, we show that the counting problem becomes #P-complete and provide an approximation algorithm. Finally, we discuss uniform generation of We acknowledge the financial support of the Future and
Agrégation de documents XML probabilistes
- In Proc. BDA
, 2009
"... Les sources d’incertitude et d’imprécision des données sont nombreuses. Une manière de gérer cette incertitude est d’associer aux données des annotations probabilistes. De nombreux modèles de bases de données probabilistes ont ainsi été proposés, dans les cadres relationnel et semi-structuré. Ce der ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Les sources d’incertitude et d’imprécision des données sont nombreuses. Une manière de gérer cette incertitude est d’associer aux données des annotations probabilistes. De nombreux modèles de bases de données probabilistes ont ainsi été proposés, dans les cadres relationnel et semi-structuré. Ce dernier est particulièrement adapté à la gestion de données incertaines provenant de traitement automatiques. Un important problème, dans le cadre des bases de données probabilistes XML, est celui des requêtes d’agrégation (count, sum, avg, etc.), qui n’a pas été étudié jusqu’à présent. Dans un modèle unifiant les différents modèles probabilistes semi-structurés étudiés à ce jour, nous présentons des algorithmes pour calculer la distribution des résultats de l’agrégation (qui exploitent certaines propriétés de régularité des fonctions d’agrégation), ainsi que des moments (en particulier, espérance et variance) de celle-ci. Nous prouvons également l’intractabilité de certains de ces problèmes. Mots-clefs: XML, données probabilistes, agrégation, complexité, algorithmes
Probabilistic XML via Markov Chains
, 2009
"... We show how Recursive Markov Chains (RMCs) and their restrictions can define probabilistic distributions over XML documents, and study tractability ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We show how Recursive Markov Chains (RMCs) and their restrictions can define probabilistic distributions over XML documents, and study tractability
Probabilistic XML via Markov Chains ∗ ABSTRACT
"... We show how Recursive Markov Chains (RMCs) and their restrictions can define probabilistic distributions over XML documents, and study tractability of querying over such models. We show that RMCs subsume several existing probabilistic XML models. In contrast to the latter, RMC models (i) capture pro ..."
Abstract
- Add to MetaCart
We show how Recursive Markov Chains (RMCs) and their restrictions can define probabilistic distributions over XML documents, and study tractability of querying over such models. We show that RMCs subsume several existing probabilistic XML models. In contrast to the latter, RMC models (i) capture probabilistic versions of XML schema languages such as DTDs, (ii) can be exponentially more succinct, and (iii) do not restrict the domain of probability distributions to be finite. We investigate RMC models for which tractability can be achieved, and identify several tractable fragments that subsume known tractable probabilistic XML models. We then look at the space of models between existing probabilistic XML formalisms and RMCs, giving results on the expressiveness and succinctness of RMC subclasses, both with each other and with prior formalisms. 1.
SPECIAL ISSUE PAPER On the expressiveness of probabilistic XML models
"... Abstract Various known models of probabilistic XML can be represented as instantiations of the abstract notion of p-documents. In addition to ordinary nodes, p-documents have distributional nodes that specify the possible worlds and their probabilistic distribution. Particular families of p-document ..."
Abstract
- Add to MetaCart
Abstract Various known models of probabilistic XML can be represented as instantiations of the abstract notion of p-documents. In addition to ordinary nodes, p-documents have distributional nodes that specify the possible worlds and their probabilistic distribution. Particular families of p-documents are determined by the types of distributional nodes that can be used as well as by the structural constraints on the placement of those nodes in a p-document. Some of the resulting families provide natural extensions and combinations of previously studied probabilistic XML models. The focus of the paper is on the expressive power of families of p-documents. In particular, two main issues are studied. Some of the results described in this paper were reported in [1,2]. The work of Abiteboul and Senellart was supported by the Agence
Transducing Markov Sequences Extended Abstract
"... A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a ..."
Abstract
- Add to MetaCart
A Markov sequence is a basic statistical model representing uncertain sequential data, and it is used within a plethora of applications, including speech recognition, image processing, computational biology, radio-frequency identification (RFID), and information extraction. The problem of querying a Markov sequence is studied under the conventional semantics of querying a probabilistic database, where queries are formulated as finite-state transducers. Specifically, the complexity of two main problems is analyzed. The first problem is that of computing the confidence (probability) of an answer. The second is the enumeration of the answers in the order of decreasing confidence (with the generation of the top-k answers as a special case), or in an approximate order thereof. In particular, it is shown that enumeration in any sub-exponential-approximate order is generally intractable (even for some fixed transducers), and a matching upper bound is obtained through a proposed heuristic. Due to this hardness, a special consideration is given to restricted (yet common) classes of transducers that extract matches of a regular expression (subject to prefix and suffix constraints), and it is shown that these classes are, indeed, significantly more tractable.
Modeling, Querying, and Mining Uncertain XML Data
"... This chapter deals with data mining in uncertain XML data models, this uncertainty typically coming from imprecise automatic processes. We first review the literature on modeling uncertain data, starting with well-studied relational models and moving then to their semistructured counterparts. We foc ..."
Abstract
- Add to MetaCart
This chapter deals with data mining in uncertain XML data models, this uncertainty typically coming from imprecise automatic processes. We first review the literature on modeling uncertain data, starting with well-studied relational models and moving then to their semistructured counterparts. We focus on a specific probabilistic XML model, that allows representing arbitrary finite distributions of XML documents, and has been extended to also allow continuous distributions of data values. We summarize previous work on querying this uncertain data model and show how to apply the corresponding techniques to several data mining tasks, exemplified through use cases on two running examples. 1
Aggregating Probabilistic XML
"... Les sources d’incertitude et d’imprécision des données sont nombreuses. Une manière de gérer cette incertitude est d’associer aux données des annotations probabilistes. De nombreux modèles de bases de données probabilistes ont ainsi été proposés, dans les cadres relationnel et semi-structuré. Ce der ..."
Abstract
- Add to MetaCart
Les sources d’incertitude et d’imprécision des données sont nombreuses. Une manière de gérer cette incertitude est d’associer aux données des annotations probabilistes. De nombreux modèles de bases de données probabilistes ont ainsi été proposés, dans les cadres relationnel et semi-structuré. Ce dernier est particulièrement adapté à la gestion de données incertaines provenant de traitement automatiques. Un important problème, dans le cadre des bases de données probabilistes XML, est celui des requêtes d’agrégation (count, sum, avg, etc.), qui n’a pas été étudié jusqu’à présent. Dans un modèle unifiant les différents modèles probabilistes semi-structurés étudiés à ce jour, nous présentons des algorithmes pour calculer la distribution des résultats de l’agrégation (qui exploitent certaines propriétés de régularité des fonctions d’agrégation), ainsi que des moments (en particulier, espérance et variance) de celle-ci. Nous prouvons également l’intractabilité de certains de ces problèmes.

