Results 1 - 10
of
26
YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia
, 2010
"... We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy o ..."
Abstract
-
Cited by 158 (20 self)
- Add to MetaCart
We present YAGO2, an extension of the YAGO knowledge base, in which entities, facts, and events are anchored in both time and space. YAGO2 is built automatically from Wikipedia, GeoNames, and WordNet. It contains 80 million facts about 9.8 million entities. Human evaluation confirmed an accuracy of 95 % of the facts in YAGO2. In this paper, we present the extraction methodology, the integration of the spatio-temporal dimension, and our knowledge representation SPOTL, an extension of the original SPO-triple
Scalable knowledge harvesting with high precision and high recall
- In WSDM
, 2011
"... Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
(Show Context)
Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data. This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of n-gram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates. We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.
Knowledge Vault: A Web-scale approach to probabilistic knowledge fusion
- In submission
, 2014
"... Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Mi-crosoft’s Satori, and Google’s Knowledge Graph. To in-crease the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous ap-proaches have pr ..."
Abstract
-
Cited by 49 (6 self)
- Add to MetaCart
(Show Context)
Recent years have witnessed a proliferation of large-scale knowledge bases, including Wikipedia, Freebase, YAGO, Mi-crosoft’s Satori, and Google’s Knowledge Graph. To in-crease the scale even further, we need to explore automatic methods for constructing knowledge bases. Previous ap-proaches have primarily focused on text-based extraction, which can be very noisy. Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that com-bines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repos-itories. We employ supervised machine learning methods for fusing these distinct information sources. The Knowledge Vault is substantially bigger than any previously published structured knowledge repository, and features a probabilis-tic inference system that computes calibrated probabilities of fact correctness. We report the results of multiple studies that explore the relative utility of the different information sources and extraction methods. Keywords Knowledge bases; information extraction; probabilistic mod-els; machine learning 1.
Temporal Information Retrieval: Challenges and Opportunities
- IN: 1ST TEMPORAL WEB ANALYTICS WORKSHOP AT WWW
, 2011
"... Time is an important dimension of any information space. It can be very useful for a wide range of information retrieval tasks such as document exploration, similarity search, summarization, and clustering. Traditionally, information retrieval applications do not take full advantage of all the tempo ..."
Abstract
-
Cited by 30 (5 self)
- Add to MetaCart
Time is an important dimension of any information space. It can be very useful for a wide range of information retrieval tasks such as document exploration, similarity search, summarization, and clustering. Traditionally, information retrieval applications do not take full advantage of all the temporal information embedded in documents to provide alternative search features and user experience. However, in the last few years there has been exciting work on analyzing and exploiting temporal information for the presentation, organization, and in particular the exploration of search results. In this paper, we review the current research trends and present a number of interesting applications along with open problems. The goal is to discuss interesting areas and future work for this exciting field of information management.
DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference
"... We present an end-to-end (live) demonstration system called DeepDive that performs knowledge-base construction (KBC) from hundreds of millions of web pages. DeepDive employs statistical learning and inference to combine diverse data resources and best-of-breed algorithms. A key challenge of this app ..."
Abstract
-
Cited by 17 (1 self)
- Add to MetaCart
(Show Context)
We present an end-to-end (live) demonstration system called DeepDive that performs knowledge-base construction (KBC) from hundreds of millions of web pages. DeepDive employs statistical learning and inference to combine diverse data resources and best-of-breed algorithms. A key challenge of this approach is scalability, i.e., how to deal with terabytes of imperfect data efficiently. We describe how we address the scalability challenges to achieve web-scale KBC and the lessons we have learned from building DeepDive. 1.
Elementary: Large-scale Knowledge-base Construction via Machine Learning and Statistical Inference
"... Researchers have approached knowledge-base construction (KBC) with a wide range of data resources and techniques. We present Elementary, a prototype KBC system that is able to combine diverse resources and different KBC techniques via machine learning and statistical inference to construct knowledge ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
(Show Context)
Researchers have approached knowledge-base construction (KBC) with a wide range of data resources and techniques. We present Elementary, a prototype KBC system that is able to combine diverse resources and different KBC techniques via machine learning and statistical inference to construct knowledge bases. Using Elementary, we have implemented a solution to the TAC-KBP challenge with quality comparable to the state of the art, as well as an end-to-end online demonstration that automatically and continuously enriches Wikipedia with structured data by reading millions of webpages on a daily basis. We describe several challenges and our solutions in designing, implementing, and deploying Elementary. In particular, we first describe the conceptual framework and architecture of Elementary, and then discuss how we address scalability challenges to enable Web-scale deployment. First, to take advantage of diverse data resources and proven techniques, Elementary employs Markov logic, a succinct yet expressive language to specify probabilistic graphical models. Elementary accepts both domain-knowledge rules and classical machine-learning models such as conditional random fields, thereby integrating different data resources and KBC techniques in a principled manner. Second, to support large-scale KBC with terabytes of data and millions of entities, Elementary
Incremental knowledge base construction using DeepDive
- Proceedings of the VLDB Endowment (PVLDB
, 2015
"... Populating a database with unstructured information is a long-standing problem in industry and research that encom-passes problems of extraction, cleaning, and integration. Re-cent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we desc ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
(Show Context)
Populating a database with unstructured information is a long-standing problem in industry and research that encom-passes problems of extraction, cleaning, and integration. Re-cent names used for this problem include dealing with dark data and knowledge base construction (KBC). In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we de-velop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremen-tal inference, based respectively on sampling and variational techniques. We also study the tradeoff space of these meth-ods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate Deep-Dive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality. 1.
An Efficient Publish/Subscribe Index for E-Commerce Databases
"... Many of today’s publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Many of today’s publish/subscribe (pub/sub) systems have been designed to cope with a large volume of subscriptions and high event arrival rate (velocity). However, in many novel applications (such as e-commerce), there is an increasing variety of items, each with different attributes. This leads to a very high-dimensional and sparse database that existing pub/sub systems can no longer support effectively. In this paper, we propose an efficient in-memory index that is scalable to the volume and update of subscriptions, the arrival rate of events and the variety of subscribable attributes. The index is also extensible to support complex scenarios such as prefix/suffix filtering and regular expression matching. We conduct extensive experiments on synthetic datasets and two real datasets (AOL query log and Ebay products). The results demonstrate the superiority of our index over state-of-the-art methods: our index incurs orders of magnitude less index construction time, consumes a small amount of memory and performs event matching efficiently. 1.
Learning relatedness measures for entity linking
- In Proceedings of CIKM
, 2013
"... Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowl-edge base. The most important of such features is entity relate ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Entity Linking is the task of detecting, in text documents, relevant mentions to entities of a given knowledge base. To this end, entity-linking algorithms use several signals and features extracted from the input text or from the knowl-edge base. The most important of such features is entity relatedness. Indeed, we argue that these algorithms benefit from maximizing the relatedness among the relevant enti-ties selected for annotation, since this minimizes errors in disambiguating entity-linking. The definition of an e↵ective relatedness function is thus a crucial point in any entity-linking algorithm. In this paper we address the problem of learning high-quality entity relat-edness functions. First, we formalize the problem of learn-ing entity relatedness as a learning-to-rank problem. We propose a methodology to create reference datasets on the basis of manually annotated data. Finally, we show that our machine-learned entity relatedness function performs better than other relatedness functions previously proposed, and, more importantly, improves the overall performance of dif-ferent state-of-the-art entity-linking algorithms.
Automatic Extraction of Facts, Relations, and Entities for Web-Scale Knowledge Base Population
, 2012
"... I hereby solemnly declare that this work was created on my own, using only the resources and tools mentioned. Information taken from other sources or indirectly adopted data and concepts are explicitly acknowl-edged with references to the respective sources. This work has not been submitted in a pro ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
I hereby solemnly declare that this work was created on my own, using only the resources and tools mentioned. Information taken from other sources or indirectly adopted data and concepts are explicitly acknowl-edged with references to the respective sources. This work has not been submitted in a process for obtaining an academic degree elsewhere in the same or in similar form. Eidesstattliche Versicherung Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit selb-stständig und ohne Benutzung anderer als der angegebenen Hilfsmittel angefertigt habe. Die aus anderen Quellen oder indirekt übernommenen