Results 1 - 10
of
65
BabelNet: The automatic construction, evaluation and application of a . . .
- ARTIFICIAL INTELLIGENCE
, 2012
"... ..."
A Framework for Benchmarking Entity-Annotation Systems
"... In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-annotation systems. The framework is based upon the definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic c ..."
Abstract
-
Cited by 30 (1 self)
- Add to MetaCart
(Show Context)
In this paper we design and implement a benchmarking framework for fair and exhaustive comparison of entity-annotation systems. The framework is based upon the definition of a set of problems related to the entity-annotation task, a set of measures to evaluate systems performance, and a systematic comparative evaluation involving all publicly available datasets, containing texts of various types such as news, tweets and Web pages. Our framework is easily-extensible with novel entity annotators, datasets and evaluation measures for comparing systems, and it has been released to the public as open source 1. We use this framework to perform the first extensive comparison among all available entity annotators over all available datasets, and draw many interesting conclusions upon their efficiency and effectiveness. We also draw conclusions between academic versus commercial annotators.
Text relatedness based on a word thesaurus.
, 2010
"... Abstract The computation of relatedness between two fragments of text in an automated manner requires taking into account a wide range of factors pertaining to the meaning the two fragments convey, and the pairwise relations between their words. Without doubt, a measure of relatedness between text ..."
Abstract
-
Cited by 27 (8 self)
- Add to MetaCart
(Show Context)
Abstract The computation of relatedness between two fragments of text in an automated manner requires taking into account a wide range of factors pertaining to the meaning the two fragments convey, and the pairwise relations between their words. Without doubt, a measure of relatedness between text segments must take into account both the lexical and the semantic relatedness between words. Such a measure that captures well both aspects of text relatedness may help in many tasks, such as text retrieval, classification and clustering. In this paper we present a new approach for measuring the semantic relatedness between words based on their implicit semantic links. The approach exploits only a word thesaurus in order to devise implicit semantic links between words. Based on this approach, we introduce Omiotis, a new measure of semantic relatedness between texts which capitalizes on the word-to-word semantic relatedness measure (SR) and extends it to measure the relatedness between texts. We gradually validate our method: we first evaluate the performance of the semantic relatedness measure between individual words, covering word-to-word similarity and relatedness, synonym identification and word analogy; then, we proceed with evaluating the performance of our method in measuring text-to-text semantic relatedness in two tasks, namely sentence-to-sentence similarity and paraphrase recognition. Experimental evaluation shows that the proposed method outperforms every lexicon-based method of semantic relatedness in the selected tasks and the used data sets, and competes well against corpus-based and hybrid approaches.
Fast and accurate annotation of short texts with wikipedia pages, arXiv preprint arXiv:1006.3498
, 2010
"... We address the problem of cross-referencing text fragments with Wikipedia pages, in a way that synonymy and poly-semy issues are resolved accurately and efficiently. We take inspiration from a recent flow of work [3, 10, 12, 14], and ex-tend their scenario from the annotation of long documents to th ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
We address the problem of cross-referencing text fragments with Wikipedia pages, in a way that synonymy and poly-semy issues are resolved accurately and efficiently. We take inspiration from a recent flow of work [3, 10, 12, 14], and ex-tend their scenario from the annotation of long documents to the annotation of short texts, such as snippets of search-engine results, tweets, news, blogs, etc.. These short and poorly composed texts pose new challenges in terms of effi-ciency and effectiveness of the annotation process, that we address by designing and engineering Tagme, the first sys-tem that performs an accurate and on-the-fly annotation of these short textual fragments. A large set of experiments shows that Tagme outperforms state-of-the-art algorithms when they are adapted to work on short texts and it results fast and competitive on long texts. 1.
Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction
"... Web search result clustering aims to facilitate information search on the Web. Rather than presenting the results of a query as a flat list, these are grouped on the basis of their similarity and subsequently shown to the user as a list of possibly labeled clusters. Each cluster is supposed to repre ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
Web search result clustering aims to facilitate information search on the Web. Rather than presenting the results of a query as a flat list, these are grouped on the basis of their similarity and subsequently shown to the user as a list of possibly labeled clusters. Each cluster is supposed to represent a different meaning of the input query, thus taking into account the language ambiguity, i.e. polysemy, issue. However, Web clustering methods typically rely on some shallow notion of textual similarity of search result snippets. As a result, text snippets with no word in common tend to be clustered separately, even if they share the same meaning, whereas snippets with words in common may be grouped together even if they refer to different meanings of the input query. In this paper, we present a novel approach to Web search result clustering based on the automatic discovery of word senses from raw text, a task referred to as Word Sense Induction (WSI). Key to our approach is to first acquire the senses (i.e., meanings) of an ambiguous query and then cluster the search results based on their semantic similarity to the word senses induced. Our experiments, conducted on datasets of ambiguous queries, show that our approach outperforms both Web clustering and search engines. 1.
The tower of babel meets web 2.0: user-generated content and its applications in a multilingual context.
- In Conference on Human Factors in Computing Systems,
, 2010
"... ABSTRACT This study explores language's fragmenting effect on usergenerated content by examining the diversity of knowledge representations across 25 different Wikipedia language editions. This diversity is measured at two levels: the concepts that are included in each edition and the ways in ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
ABSTRACT This study explores language's fragmenting effect on usergenerated content by examining the diversity of knowledge representations across 25 different Wikipedia language editions. This diversity is measured at two levels: the concepts that are included in each edition and the ways in which these concepts are described. We demonstrate that the diversity present is greater than has been presumed in the literature and has a significant influence on applications that use Wikipedia as a source of world knowledge. We close by explicating how knowledge diversity can be beneficially leveraged to create "culturally-aware applications" and "hyperlingual applications".
Large-Scale Learning of Word Relatedness with Constraints
"... Prior work on computing semantic relatedness of words focused on representing their meaning in isolation, effectively disregarding inter-word affinities. We propose a large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the lear ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
(Show Context)
Prior work on computing semantic relatedness of words focused on representing their meaning in isolation, effectively disregarding inter-word affinities. We propose a large-scale data mining approach to learning word-word relatedness, where known pairs of related words impose constraints on the learning process. Our method, called CLEAR, is shown to significantly outperform previously published approaches. The proposed method is based on first principles, and is generic enough to exploit diverse types of text corpora, while having the flexibility to impose constraints on the derived word similarities. We also make publicly available a new labeled dataset for evaluating word relatedness algorithms, which we believe to be the largest such dataset to date.
Learning a Concept-based Document Similarity Measure
"... Document similarity measures are crucial components of many text analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: they estimate the surface overlap between documents based on the words they mention and ignore deeper ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
Document similarity measures are crucial components of many text analysis tasks, including information retrieval, document classification, and document clustering. Conventional measures are brittle: they estimate the surface overlap between documents based on the words they mention and ignore deeper semantic connections. We propose a new measure that assesses similarity at both the lexical and semantic levels, and learns from human judgments how to combine them by using machine learning techniques. Experiments show that the new measure produces values for documents that are more consistent with people’s judgments than people are with each other. We also use it to classify and cluster large document sets covering different genres and topics, and find that it improves both classification and clustering performance. 1
Music Retagging Using Label Propagation and Robust Principal Component Analysis ABSTRACT
"... The emergence of social tagging websites such as Last.fm has provided new opportunities for learning computational models that automatically tag music. Researchers typically obtain music tags from the Internet and use them to construct machine learning models. Nevertheless, such tags are usually noi ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
The emergence of social tagging websites such as Last.fm has provided new opportunities for learning computational models that automatically tag music. Researchers typically obtain music tags from the Internet and use them to construct machine learning models. Nevertheless, such tags are usually noisy and sparse. In this paper, we present a preliminary study that aims at refining (retagging) social tags by exploiting the content similarity between tracks and the semantic redundancy of the track-tag matrix. The evaluated algorithms include a graph-based label propagation method that is often used in semi-supervised learning and a robust principal component analysis (PCA) algorithm that has led to state-of-the-art results in matrix completion. The results indicate that robust PCA with content similarity constraint is particularlyeffective; itimprovestherobustnessoftagging against three types of synthetic errors and boosts the recall rate of music auto-tagging by 7 % in a real-world setting.
Automated Query Learning with Wikipedia and Genetic Programming
- Artificial Intelligence
, 2013
"... ar ..."
(Show Context)