• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

From Information to Knowledge: Harvesting Entities and Relationships from Web Sources

by Gerhard Weikum, Martin Theobald
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 26
Next 10 →

YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia

by Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Gerhard Weikum , 2010
"... We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy o ..."
Abstract - Cited by 158 (20 self) - Add to MetaCart
We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95 % of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple

Scalable knowledge harvesting with high precision and high recall

by Ula Nakashole, Martin Theobald, Gerhard Weikum - In WSDM , 2011
"... Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade ..."
Abstract - Cited by 53 (6 self) - Add to MetaCart
Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data. This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of n-gram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates. We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.
(Show Context)

Citation Context

...ion Major advances in information extraction [25, 14] and the success and high quality of knowledge-sharing communities like Wikipedia have enabled the automated construction of large knowledge bases =-=[1, 32]-=-. Notable efforts along these lines include ground-breaking academic projects such as opencyc.org, dbpedia.org [4], knowitall Permission to make digital or hard copies of all or part of this work for ...

Knowledge Vault: A Web-scale approach to probabilistic knowledge fusion

by Xin Luna Dong, Kevin Murphy, Thomas Strohmann, Shaohua Sun, Wei Zhang - In submission , 2014
"... Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Mi-crosoft’s Satori, and Google’s Knowledge Graph. To in-crease the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous ap-proaches have pr ..."
Abstract - Cited by 49 (6 self) - Add to MetaCart
Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Mi-crosoft’s Satori, and Google’s Knowledge Graph. To in-crease the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous ap-proaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that com-bines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repos-itories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilis-tic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods. Keywords Knowledge bases; information extraction; probabilistic mod-els; machine learning 1.
(Show Context)

Citation Context

...an approach should automatically extract facts from the whole Web, to augment the knowledge we collect from human input and structured data sources. Unfortunately, standard methods for this task (cf. =-=[44]-=-) often produce very noisy, unreliable facts. To alleviate the amount of noise in the automatically extracted data, the new approach should automatically leverage already-cataloged knowledge to build ...

Temporal Information Retrieval: Challenges and Opportunities

by Omar Alonso, Jannik Strötgen, Ricardo Baeza-Yates, Michael Gertz - IN: 1ST TEMPORAL WEB ANALYTICS WORKSHOP AT WWW , 2011
"... Time is an important dimension of any information space. It can be very useful for a wide range of information retrieval tasks such as document exploration, similarity search, summarization, and clustering. Traditionally, information retrieval applications do not take full advantage of all the tempo ..."
Abstract - Cited by 30 (5 self) - Add to MetaCart
Time is an important dimension of any information space. It can be very useful for a wide range of information retrieval tasks such as document exploration, similarity search, summarization, and clustering. Traditionally, information retrieval applications do not take full advantage of all the temporal information embedded in documents to provide alternative search features and user experience. However, in the last few years there has been exciting work on analyzing and exploiting temporal information for the presentation, organization, and in particular the exploration of search results. In this paper, we review the current research trends and present a number of interesting applications along with open problems. The goal is to discuss interesting areas and future work for this exciting field of information management.

DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference

by Feng Niu, Ce Zhang, Christopher Ré, Jude Shavlik
"... We present an end-to-end (live) demonstration system called DeepDive that performs knowledge-base construction (KBC) from hundreds of millions of web pages. DeepDive employs statistical learning and inference to combine diverse data resources and best-of-breed algorithms. A key challenge of this app ..."
Abstract - Cited by 17 (1 self) - Add to MetaCart
We present an end-to-end (live) demonstration system called DeepDive that performs knowledge-base construction (KBC) from hundreds of millions of web pages. DeepDive employs statistical learning and inference to combine diverse data resources and best-of-breed algorithms. A key challenge of this approach is scalability, i.e., how to deal with terabytes of imperfect data efficiently. We describe how we address the scalability challenges to achieve web-scale KBC and the lessons we have learned from building DeepDive. 1.
(Show Context)

Citation Context

...lity, these systems leverage a wide variety of data resources and KBC techniques. A crucial challenge that these systems face is coping with imperfect or conflicting information from multiple sources =-=[3, 13]-=-. To address this challenge, we present an end-to-end KBC system called DeepDive. 1 DeepDive went live in January 2012 after processing the 500M English web pages in the ClueWeb09 corpus 2 , and since...

Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference

by Feng Niu, Ce Zhang, Christopher Ré, Jude Shavlik
"... Researchers have approached knowledge-base construction (KBC) with a wide range of data resources and techniques. We present Elementary, a prototype KBC system that is able to combine diverse resources and different KBC techniques via machine learning and statistical inference to construct knowledge ..."
Abstract - Cited by 17 (5 self) - Add to MetaCart
Researchers have approached knowledge-base construction (KBC) with a wide range of data resources and techniques. We present Elementary, a prototype KBC system that is able to combine diverse resources and different KBC techniques via machine learning and statistical inference to construct knowledge bases. Using Elementary, we have implemented a solution to the TAC-KBP challenge with quality comparable to the state of the art, as well as an end-to-end online demonstration that automatically and continuously enriches Wikipedia with structured data by reading millions of webpages on a daily basis. We describe several challenges and our solutions in designing, implementing, and deploying Elementary. In particular, we first describe the conceptual framework and architecture of Elementary, and then discuss how we address scalability challenges to enable Web-scale deployment. First, to take advantage of diverse data resources and proven techniques, Elementary employs Markov logic, a succinct yet expressive language to specify probabilistic graphical models. Elementary accepts both domain-knowledge rules and classical machine-learning models such as conditional random fields, thereby integrating different data resources and KBC techniques in a principled manner. Second, to support large-scale KBC with terabytes of data and millions of entities, Elementary
(Show Context)

Citation Context

...uction Knowledge-base construction (KBC) is the process of populating a knowledge base (KB) with facts (or assertions) extracted from text. It has recently received tremendous interest from academia (=-=Weikum & Theobald, 2010-=-), e.g., CMU’s NELL (Carlson et al., 2010; Lao, Mitchell, & Cohen, 2011), MPI’s YAGO (Kasneci, Ramanath, Suchanek, & Weikum, 2008; Nakashole, Theobald, & Weikum, 2011), and from industry (Fang, Sarma,...

Incremental knowledge base construction using DeepDive

by Jaeho Shin, Sen Wu, Feiran Wang, Christopher De, Sa Ce, Zhang† Christopher Ré - Proceedings of the VLDB Endowment (PVLDB , 2015
"... Populating a database with unstructured information is a long-standing problem in industry and research that encom-passes problems of extraction, cleaning, and integration. Re-cent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we desc ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
Populating a database with unstructured information is a long-standing problem in industry and research that encom-passes problems of extraction, cleaning, and integration. Re-cent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we de-velop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremen-tal inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these meth-ods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate Deep-Dive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality. 1.
(Show Context)

Citation Context

...ceived renewed interest in the database community through high-profile start-up companies (e.g., Tamr and Trifacta), established companies like IBM’s Watson [7, 16], and a variety of research efforts =-=[11, 25,28,36,40]-=-. At the same time, communities such as natural language processing and machine learning are attacking similar problems under the name knowledge base construction (KBC) [5, 14, 23]. While different co...

An Efficient Publish/Subscribe Index for E-Commerce Databases

by Dongxiang Zhang, Chee-yong Chan, Kian-lee Tan
"... Many of today’s publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to ..."
Abstract - Cited by 5 (2 self) - Add to MetaCart
Many of today’s publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to a very high-dimensional and sparse database that existing pub/sub systems can no longer support effectively. In this paper, we propose an efficient in-memory index that is scalable to the volume and update of subscriptions, the arrival rate of events and the variety of subscribable attributes. The index is also extensible to support complex scenarios such as prefix/suffix filtering and regular expression matching. We conduct extensive experiments on synthetic datasets and two real datasets (AOL query log and Ebay products). The results demonstrate the superiority of our index over state-of-the-art methods: our index incurs orders of magnitude less index construction time, consumes a small amount of memory and performs event matching efficiently. 1.
(Show Context)

Citation Context

...ould be of utmost importance to business dealers, e.g., to monitor the potential competitors within an area. • Web Tables and Semantic RDF Database. In recent years, harvesting knowledge from the web =-=[11, 24, 25, 28]-=- has attracted more and more attention. For example, Google’s Freebase [1] has collected and published more than 39 million real world entities, with more than 140,000 attributes. These structured or ...

Learning relatedness measures for entity linking

by Diego Ceccarelli, Claudio Lucchese, Salvatore Orl, Raffaele Perego, Salvatore Trani - In Proceedings of CIKM , 2013
"... Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowl-edge base. The most important of such features is entity relate ..."
Abstract - Cited by 4 (2 self) - Add to MetaCart
Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowl-edge base. The most important of such features is entity relatedness. Indeed, we argue that these algorithms benefit from maximizing the relatedness among the relevant enti-ties selected for annotation, since this minimizes errors in disambiguating entity-linking. The definition of an e↵ective relatedness function is thus a crucial point in any entity-linking algorithm. In this paper we address the problem of learning high-quality entity relat-edness functions. First, we formalize the problem of learn-ing entity relatedness as a learning-to-rank problem. We propose a methodology to create reference datasets on the basis of manually annotated data. Finally, we show that our machine-learned entity relatedness function performs better than other relatedness functions previously proposed, and, more importantly, improves the overall performance of dif-ferent state-of-the-art entity-linking algorithms.
(Show Context)

Citation Context

...0 ...$15.00. http://dx.doi.org/10.1145/2505515.2505711. 1. INTRODUCTION Document enriching is today a fundamental technique to improve the quality of several text analysis tasks, including Web search =-=[21, 24]-=-. In this work we specifically address the Entity Linking Problem: given a plain text, the entity linking task aims at identifying the small fragments of text (in the following interchangeably called ...

Automatic Extraction of Facts, Relations, and Entities for Web-Scale Knowledge Base Population

by Ndapandula T. Nakashole, Dean Prof, Mark Groves, Prof Gerhard Weikum, Second Reviewer, Dr. Fabian Suchanek, Third Reviewer, Prof Tom, M. Mitchell, Dr. Rainer Gemulla, Chairman Prof, Jens Dittrich , 2012
"... I hereby solemnly declare that this work was created on my own, using only the resources and tools mentioned. Information taken from other sources or indirectly adopted data and concepts are explicitly acknowl-edged with references to the respective sources. This work has not been submitted in a pro ..."
Abstract - Cited by 3 (0 self) - Add to MetaCart
I hereby solemnly declare that this work was created on my own, using only the resources and tools mentioned. Information taken from other sources or indirectly adopted data and concepts are explicitly acknowl-edged with references to the respective sources. This work has not been submitted in a process for obtaining an academic degree elsewhere in the same or in similar form. Eidesstattliche Versicherung Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit selb-stständig und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Die aus anderen Quellen oder indirekt übernommenen
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University