Results 1 - 10
of
79
YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia
, 2010
"... We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy o ..."
Abstract
-
Cited by 158 (20 self)
- Add to MetaCart
We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95 % of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple
Scalable knowledge harvesting with high precision and high recall
- In WSDM
, 2011
"... Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
(Show Context)
Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data. This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of n-gram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates. We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.
Unsupervised ontological induction from text
- In Proc. of ACL
, 2010
"... Extracting knowledge from unstructured text is a long-standing goal of NLP. Although learning approaches to many of its subtasks have been developed (e.g., parsing, taxonomy induction, information extraction), all end-to-end solutions to date require heavy supervision and/or manual engineering, limi ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
Extracting knowledge from unstructured text is a long-standing goal of NLP. Although learning approaches to many of its subtasks have been developed (e.g., parsing, taxonomy induction, information extraction), all end-to-end solutions to date require heavy supervision and/or manual engineering, limiting their scope and scalability. We present OntoUSP, a system that induces and populates a probabilistic ontology using only dependency-parsed text as input. OntoUSP builds on the USP unsupervised semantic parser by jointly forming ISA and IS-PART hierarchies of lambda-form clusters. The ISA hierarchy allows more general knowledge to be learned, and the use of smoothing for parameter estimation. We evaluate OntoUSP by using it to extract a knowledge base from biomedical abstracts and answer questions. OntoUSP improves on the recall of USP by 47 % and greatly outperforms previous state-of-the-art approaches. 1
From Information to Knowledge: Harvesting Entities and Relationships from Web Sources
"... There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-l ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
(Show Context)
There are major trends to advance the functionality of search engines to a more expressive semantic level. This is enabled by the advent of knowledge-sharing communities such as Wikipedia and the progress in automatically extracting entities and relationships from semistructured as well as natural-language Web sources. Recent endeavors of this kind include DBpedia, EntityCube, KnowItAll, ReadTheWeb, and our own YAGO-NAGA project (and others). The goal is to automatically construct and maintain a comprehensive knowledge base of facts about named entities, their semantic classes, and their mutual relations as well as temporal contexts, with high precision and high recall. This tutorial discusses state-ofthe-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting.
ClausIE: Clause-Based Open Information Extraction
"... We propose ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text. ClausIE fundamentally differs from previous approaches in that it separates the detection of “useful ” pieces of information expressed in a sent ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
(Show Context)
We propose ClausIE, a novel, clause-based approach to open information extraction, which extracts relations and their arguments from natural language text. ClausIE fundamentally differs from previous approaches in that it separates the detection of “useful ” pieces of information expressed in a sentence from their representation in terms of extractions. In more detail, ClausIE exploits linguistic knowledge about the grammar of the English language to first detect clauses in an input sentence and to subsequently identify the type of each clause according to the grammatical function of its constituents. Based on this information, ClausIE is able to generate high-precision extractions; the representation of these extractions can be flexibly customized to the underlying application. ClausIE is based on dependency parsing and a small set of domain-independent lexica, operates sentence by sentence without any post-processing, and requires no training data (whether labeled or unlabeled). Our experimental study on various real-world datasets suggests that ClausIE obtains higher recall and higher precision than existing approaches, both on high-quality text as well as on noisy text as found in the web. Categories and Subject Descriptors I.2.7 [Computing Methodologies]: Artificial intelligence-Natural language processing
Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference
"... Researchers have approached knowledge-base construction (KBC) with a wide range of data resources and techniques. We present Elementary, a prototype KBC system that is able to combine diverse resources and different KBC techniques via machine learning and statistical inference to construct knowledge ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
(Show Context)
Researchers have approached knowledge-base construction (KBC) with a wide range of data resources and techniques. We present Elementary, a prototype KBC system that is able to combine diverse resources and different KBC techniques via machine learning and statistical inference to construct knowledge bases. Using Elementary, we have implemented a solution to the TAC-KBP challenge with quality comparable to the state of the art, as well as an end-to-end online demonstration that automatically and continuously enriches Wikipedia with structured data by reading millions of webpages on a daily basis. We describe several challenges and our solutions in designing, implementing, and deploying Elementary. In particular, we first describe the conceptual framework and architecture of Elementary, and then discuss how we address scalability challenges to enable Web-scale deployment. First, to take advantage of diverse data resources and proven techniques, Elementary employs Markov logic, a succinct yet expressive language to specify probabilistic graphical models. Elementary accepts both domain-knowledge rules and classical machine-learning models such as conditional random fields, thereby integrating different data resources and KBC techniques in a principled manner. Second, to support large-scale KBC with terabytes of data and millions of entities, Elementary
Linking Data from RESTful Services
- THIRD WORKSHOP ON LINKED DATA ON THE WEB
, 2010
"... One of the main goals of the Semantic Web is to extend current human-readable Web resources with semantic information encoded in a machine-processable form. One of its most successful approaches is the Web of Data which by following the principles of Linked Data have made available several data sour ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
One of the main goals of the Semantic Web is to extend current human-readable Web resources with semantic information encoded in a machine-processable form. One of its most successful approaches is the Web of Data which by following the principles of Linked Data have made available several data sources compliant with the Semantic Web technologies, such as, RDF triple stores, and SPARQL endpoints. On the other hand, the set of the architectural principles that underlie the human-readable Web has been conceptualized as the Representational State Transfer (REST) architectural style. In this paper, we distill REST concepts in order to provide a mechanism for describing REST (i.e. human-readable Web) resources and transform them into semantic resources. The strategy allowed us to harvest already existing Web resources without requiring changes on the original sources, or ad-hoc interfaces. The presented strategy aims to contribute to the availability of more semantic datasets and become a further step to lower the entry barrier to semantic resources publishing.
Query relaxation for entityrelationship search
- In ESWC
, 1997
"... Abstract. Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Entity-relationship-structured data is becoming more important on the Web. For example, large knowledge bases have been automatically constructed by information extraction from Wikipedia and other Web sources. Entities and relationships can be represented by subject-property-object triples in the RDF model, and can then be precisely searched by structured query languages like SPARQL. Because of their Boolean-match semantics, such queries often return too few or even no results. To improve recall, it is thus desirable to support users by automatically relaxing or reformulating queries in such a way that the intention of the original user query is preserved while returning a sufficient number of ranked results. In this paper we describe comprehensive methods to relax SPARQL-like triplepattern queries in a fully automated manner. Our framework produces a set of relaxations by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified result list, with ranking based on any ranking function for structured queries over RDF-data. Our experimental evaluation, with two different datasets about movies and books, shows the effectiveness of the automatically generated relaxations and the improved quality of query results based on assessments collected on the Amazon Mechanical Turk platform.
Weikum: Fine-grained Semantic Typing of Emerging Entities, ACL
"... Methods for information extraction (IE) and knowledge base (KB) construction have been intensively studied. However, a largely under-explored case is tapping into highly dynamic sources like news streams and social media, where new entities are continuously emerging. In this paper, we present a meth ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Methods for information extraction (IE) and knowledge base (KB) construction have been intensively studied. However, a largely under-explored case is tapping into highly dynamic sources like news streams and social media, where new entities are continuously emerging. In this paper, we present a method for discovering and semantically typing newly emerging out-of-KB entities, thus improving the freshness and recall of ontology-based IE and improving the precision and semantic rigor of open IE. Our method is based on a probabilistic model that feeds weights into integer linear programs that leverage type signatures of relational phrases and type correlation or disjointness constraints. Our experimental evaluation, based on crowdsourced user studies, show our method performing significantly better than prior work. 1
Database Foundations for Scalable RDF Processing
- In Reasoning Web
"... Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and query-ing RDF with relatio ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Abstract. As more and more data is provided in RDF format, storing huge amounts of RDF data and efficiently processing queries on such data is becoming increasingly important. The first part of the lecture will introduce state-of-the-art techniques for scalably storing and query-ing RDF with relational systems, including alternatives for storing RDF, efficient index structures, and query optimization techniques. As central-ized RDF repositories have limitations in scalability and failure tolerance, decentralized architectures have been proposed. The second part of the lecture will highlight system architectures and strategies for distributed RDF processing. We cover search engines as well as federated query pro-cessing, highlight differences to classic federated database systems, and discuss efficient techniques for distributed query processing in general and for RDF data in particular. Moreover, for the last part of this chap-ter, we argue that extracting knowledge from the Web is an excellent showcase – and potentially one of the biggest challenges – for the scal-