Results 1  10
of
59
Knowledge Vault: A Webscale approach to probabilistic knowledge fusion
 In submission
, 2014
"... Recent years have witnessed a proliferation of largescale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have pr ..."
Abstract

Cited by 49 (6 self)
 Add to MetaCart
(Show Context)
Recent years have witnessed a proliferation of largescale knowledge bases, including Wikipedia, Freebase, YAGO, Microsoft’s Satori, and Google’s Knowledge Graph. To increase the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous approaches have primarily focused on textbased extraction, which can be very noisy. Here we introduce Knowledge Vault, a Webscale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilistic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods. Keywords Knowledge bases; information extraction; probabilistic models; machine learning 1.
The Complexity of Causality and Responsibility for Query Answers and nonAnswers
"... An answer to a query has a welldefined lineage expression (alternatively called howprovenance) that explains how the answer was derived. Recent work has also shown how to compute the lineage of a nonanswer to a query. However, the cause of an answer or nonanswer is a more subtle notion and consi ..."
Abstract

Cited by 43 (5 self)
 Add to MetaCart
An answer to a query has a welldefined lineage expression (alternatively called howprovenance) that explains how the answer was derived. Recent work has also shown how to compute the lineage of a nonanswer to a query. However, the cause of an answer or nonanswer is a more subtle notion and consists, in general, of only a fragment of the lineage. In this paper, we adapt Halpern, Pearl, and Chockler’s recent definitions of causality and responsibility to define the causes of answers and nonanswers to queries, and their degree of responsibility. Responsibility captures the notion of degree of causality and serves to rank potentially many causes by their relative contributions to the effect. Then, we study the complexity of computing causes and responsibilities for conjunctive queries. It is known that computing causes is NPcomplete in general. Our first main result shows that all causes to conjunctive queries can be computed by a relational query which may involve negation. Thus, causality can be computed in PTIME, and very efficiently so. Next, we study computing responsibility. Here, we prove that the complexity depends on the conjunctive query and demonstrate a dichotomy between PTIME and NPcomplete cases. For the PTIME cases, we give a nontrivial algorithm, consisting of a reduction to the maxflow computation problem. Finally, we prove that, even when it is in PTIME, responsibility is complete for LOGSPACE, implying that, unlike causality, it cannot be computed by a relational query. 1.
Incomplete information and certain answers in general data models
 In PODS’11
"... While incomplete information is ubiquitous in all data models – especially in applications involving data translation or integration – our understanding of it is still not completely satisfactory. For example, even such a basic notion as certain answers for XML queries was only introduced recently, ..."
Abstract

Cited by 15 (9 self)
 Add to MetaCart
(Show Context)
While incomplete information is ubiquitous in all data models – especially in applications involving data translation or integration – our understanding of it is still not completely satisfactory. For example, even such a basic notion as certain answers for XML queries was only introduced recently, and in a way seemingly rather different from relational certain answers. The goal of this paper is to introduce a general approach to handling incompleteness, and to test its applicability in known data models such as relations and documents. The approach is based on representing degrees of incompleteness via semanticsbased orderings on database objects. We use it to both obtain new results on incompleteness and to explain some previously observed phenomena. Specifically we show that certain answers for relational and XML queries are two instances of the same general concept; we describe structural properties behind the naïve evaluation of queries; answer open questions on the existence of certain answers in the XML setting; and show that previously studied orderingbased approaches were only adequate for SQL’s primitive view of nulls. We define a general setting that subsumes relations and documents to help us explain in a uniform way how to compute certain answers, and when good solutions can be found in data exchange. We also look at the complexity of common problems related to incompleteness, and generalize several results from relational and XML contexts.
On the Complexity of Query Answering over Incomplete XML Documents
"... Previous studies of incomplete XML documents have identified three main sources of incompleteness – in structural information, data values, and labeling – and addressed data complexity of answering analogs of unions of conjunctive queries under the open world assumption. It is known that structural ..."
Abstract

Cited by 8 (6 self)
 Add to MetaCart
(Show Context)
Previous studies of incomplete XML documents have identified three main sources of incompleteness – in structural information, data values, and labeling – and addressed data complexity of answering analogs of unions of conjunctive queries under the open world assumption. It is known that structural incompleteness leads to intractability, while incompleteness in data values and labeling still permits efficient computation of certain answers. The goal of this paper is to provide a complete picture of the complexity of query answering over incomplete XML documents. We look at more expressive languages, at other semantic assumptions, and at both data and combined complexity of query answering, to see whether some wellbehaving tractable classes have been missed. To incorporate nonpositive features into query languages, we look at gentle ways of introducing negation via inequalities and/or Boolean combinations of positive queries, as well as the analog of relational calculus. We also look at the closed world assumption which, due to the hierarchical structure of XML, has two variations. For all combinations of languages and semantics of incompleteness we determine data and combined complexity of computing certain answers. We show that structural incompleteness leads to intractability under all assumptions, while by dropping it we can recover efficient evaluation algorithms for some queries that go beyond those previously studied.
Aggregation in Probabilistic Databases via Knowledge Compilation
"... This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and se ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
(Show Context)
This paper presents a query evaluation technique for positive relational algebra queries with aggregates on a representation system for probabilistic data based on the algebraic structures of semiring and semimodule. The core of our evaluation technique is a procedure that compiles semimodule and semiring expressions into socalled decomposition trees, for which the computation of the probability distribution can be done in polynomial time in the size of the tree and of the distributions represented by its nodes. We give syntactic characterisations of tractable queries with aggregates by exploiting the connection between query tractability and polynomialtime decomposition trees. A prototype of the technique is incorporated in the probabilistic database engine SPROUT. We report on performance experiments with custom datasets and TPCH data. 1.
A TemporalProbabilistic Database Model for Information Extraction
"... Temporal annotations of facts are a key component both for building a highaccuracy knowledge base and for answering queries over the resulting temporal knowledge base with high precision and recall. In this paper, we present a temporalprobabilistic database model for cleaning uncertain temporal fac ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
(Show Context)
Temporal annotations of facts are a key component both for building a highaccuracy knowledge base and for answering queries over the resulting temporal knowledge base with high precision and recall. In this paper, we present a temporalprobabilistic database model for cleaning uncertain temporal facts obtained from information extraction methods. Specifically, we consider a combination of temporal deduction rules, temporal consistency constraints and probabilistic inference based on the common possibleworlds semantics with data lineage, and we study the theoretical properties of this data model. We further develop a query engine which is capable of scaling to very large temporal knowledge bases, with nearly interactive query response times over millions of uncertain facts and hundreds of thousands of grounded rules. Our experiments over two realworld datasets demonstrate the increased robustness of our approach compared to related techniques based on constraint solving via Integer Linear Programming (ILP) and probabilistic inference via Markov Logic Networks (MLNs). We are also able to show that our runtime performance is more than competitive to current ILP solvers and the fastest available, probabilistic but nontemporal, database engines. 1.
Incremental knowledge base construction using DeepDive
 Proceedings of the VLDB Endowment (PVLDB
, 2015
"... Populating a database with unstructured information is a longstanding problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we desc ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
(Show Context)
Populating a database with unstructured information is a longstanding problem in industry and research that encompasses problems of extraction, cleaning, and integration. Recent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these methods and develop a simple rulebased optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality. 1.
A dichotomy in the complexity of deletion propagation with functional dependencies (extended version). Accessible from the author’s home
, 2012
"... A classical variant of the viewupdate problem is deletion propagation, where tuples from the database are deleted in order to realize a desired deletion of a tuple from the view. This operation may cause a (sometimes necessary) side effect—deletion of additional tuples from the view, besides the i ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
A classical variant of the viewupdate problem is deletion propagation, where tuples from the database are deleted in order to realize a desired deletion of a tuple from the view. This operation may cause a (sometimes necessary) side effect—deletion of additional tuples from the view, besides the intentionally deleted one. The goal is to propagate deletion so as to maximize the number of tuples that remain in the view. In this paper, a view is defined by a selfjoinfree conjunctive query (sjfCQ) over a schema with functional dependencies. A condition is formulated on the schema and view definition at hand, and the following dichotomy in complexity is established. If the condition is met, then deletion propagation is solvable in polynomial time by an extremely simple algorithm (very similar to the one observed by Buneman et al.). If the condition is violated, then the problem is NPhard, and it is even hard to realize an approximation ratio that is better than some constant; moreover, deciding whether there is a sideeffectfree solution is NPcomplete. This result generalizes a recent result by Kimelfeld et al., who ignore functional dependencies. For the class of sjfCQs, it also generalizes a result by Cong et al., stating that deletion propagation is in polynomial time if keys are preserved by the view.
Ranking Query Answers in Probabilistic Databases: Complexity and Efficient Algorithms
"... Abstract—In many applications of probabilistic databases, the probabilities are mere degrees of uncertainty in the data and are not otherwisemeaningfultothe user.Often,userscare onlyabout the ranking of answers in decreasing order of their probabilities or about a few most likely answers. In this pa ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In many applications of probabilistic databases, the probabilities are mere degrees of uncertainty in the data and are not otherwisemeaningfultothe user.Often,userscare onlyabout the ranking of answers in decreasing order of their probabilities or about a few most likely answers. In this paper, we investigate the problem of ranking query answers in probabilistic databases. We give a dichotomy for ranking in case of conjunctive queries without repeating relation symbols: it is either in polynomial time or #Phard. Surprisingly, our syntactic characterisation of tractable queries is not the same as for probability computation. The key observation is that there are queries for which probability computation is #Phard, yet ranking can be computed in polynomial time. This is possible whenever probability computation for distinct answers has a common factorthatishardtocomputebutirrelevantforranking. We complement this tractability analysis with an effective rankingtechniqueforconjunctivequeries.Givenaquery,weconstruct a share plan, which exposes subqueries whose probability computation can be shared or ignored across query answers. Our technique combines share plans with incremental approximate probability computation of subqueries. We implemented our technique in the SPROUT query engine and report on performance gains of orders of magnitude over Monte Carlo simulation using FPRAS and exact probability computation based on knowledge compilation. I.
Symmetric Weighted FirstOrder Model Counting.
 In Proc. of PODS’15,
, 2015
"... Abstract Firstorder model counting emerged recently as a novel reasoning task, at the core of efficient algorithms for probabilistic logics. We present a Skolemization algorithm for model counting problems that eliminates existential quantifiers from a firstorder logic theory without changing its ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Abstract Firstorder model counting emerged recently as a novel reasoning task, at the core of efficient algorithms for probabilistic logics. We present a Skolemization algorithm for model counting problems that eliminates existential quantifiers from a firstorder logic theory without changing its weighted model count. For certain subsets of firstorder logic, lifted model counters were shown to run in time polynomial in the number of objects in the domain of discourse, where propositional model counters require exponential time. However, these guarantees apply only to Skolem normal form theories (i.e., no existential quantifiers) as the presence of existential quantifiers reduces lifted model counters to propositional ones. Since textbook Skolemization is not sound for model counting, these restrictions precluded efficient model counting for directed models, such as probabilistic logic programs, which rely on existential quantification. Our Skolemization procedure extends the applicability of firstorder model counters to these representations. Moreover, it simplifies the design of lifted model counting algorithms.