Results 1 - 10
of
37
DISCO: A multilingual database of distributionally similar words
- In Proceedings of KONVENS
"... Abstract. This paper 1 presents DISCO, a tool for retrieving the distributional similarity between two given words, and for retrieving the distributionally most similar words for a given word. Pre-computed word spaces are freely available for a number of languages including English, German, French a ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Abstract. This paper 1 presents DISCO, a tool for retrieving the distributional similarity between two given words, and for retrieving the distributionally most similar words for a given word. Pre-computed word spaces are freely available for a number of languages including English, German, French and Italian, so DISCO can be used off the shelf. The tool is implemented in Java, provides a Java API, and can also be called from the command line. The performance of DISCO is evaluated by measuring the correlation with WordNet-based semantic similarities and with human relatedness judgements. The evaluations show that DISCO has a higher correlation with semantic similarities derived from WordNet than latent semantic analysis (LSA) and the web-based PMI-IR. 1
Supervised Latent Semantic Indexing using Adaptive Sprinkling
- In Proc. of IJCAI
, 2007
"... Latent Semantic Indexing (LSI) has been shown to be effective in recovering from synonymy and pol-ysemy in text retrieval applications. However, since LSI ignores class labels of training documents, LSI generated representations are not as effective in classification tasks. To address this limitatio ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Latent Semantic Indexing (LSI) has been shown to be effective in recovering from synonymy and pol-ysemy in text retrieval applications. However, since LSI ignores class labels of training documents, LSI generated representations are not as effective in classification tasks. To address this limitation, a process called ‘sprinkling ’ is presented. Sprinkling is a simple extension of LSI based on augmenting the set of features using additional terms that en-code class knowledge. However, a limitation of sprinkling is that it treats all classes (and classi-fiers) in the same way. To overcome this, we pro-pose a more principled extension called Adaptive Sprinkling (AS). AS leverages confusion matrices to emphasise the differences between those classes which are hard to separate. The method is tested on diverse classification tasks, including those where classes share ordinal or hierarchical relationships. These experiments reveal that AS can significantly enhance the performance of instance-based tech-niques (kNN) to make them competitive with the state-of-the-art SVM classifier. The revised repre-sentations generated by AS also have a favourable impact on SVM performance. 1
LATENT SEMANTIC INDEXING USING EIGENVALUE ANALYSIS FOR EFFICIENT INFORMATION RETRIEVAL
"... Text retrieval using Latent Semantic Indexing (LSI) with truncated Singular Value Decomposition (SVD) has been intensively studied in recent years. However, the expensive complexity involved in computing truncated SVD constitutes a major drawback of the LSI method. In this paper, we demonstrate how ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Text retrieval using Latent Semantic Indexing (LSI) with truncated Singular Value Decomposition (SVD) has been intensively studied in recent years. However, the expensive complexity involved in computing truncated SVD constitutes a major drawback of the LSI method. In this paper, we demonstrate how matrix rank approximation can influence the effectiveness of information retrieval systems. Besides, we present an implementation of the LSI method based on an eigenvalue analysis for rank approximation without computing truncated SVD, along with its computational details. Significant improvements in computational time while maintaining retrieval accuracy are observed over the tested document collections.
Text Categorization Using Word Similarities Based on Higher Order Co-occurrences *†
"... Abstract 12 In this paper, we propose an extension of the χ-Sim coclustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
Abstract 12 In this paper, we propose an extension of the χ-Sim coclustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some “a priori ” knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM. Keywords: Text categorization, clustering, Higher-order co-occurrences, supervised learning.
Identification of Critical Values in Latent Semantic Indexing
- Foundations of Data Mining and Knowledge Discovery
, 2005
"... This paper reports the results of a study to determine the most critical elements of the T k and S k D k matrices, which are input to LSI. We are interested in the impact, both in terms of retrieval quality and query run time performance, of the removal (zeroing) of a large portion of the entries in ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
This paper reports the results of a study to determine the most critical elements of the T k and S k D k matrices, which are input to LSI. We are interested in the impact, both in terms of retrieval quality and query run time performance, of the removal (zeroing) of a large portion of the entries in these matrices
An analysis of latent semantic term self-correlation
, 2008
"... Latent semantic analysis (LSA) is a generalised vector space method (GVSM) that uses dimension reduction to generate term correlations for use during the information retrieval process. We hypothesised that even though the dimension reduction establishes correlations between terms, the reduction is c ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Latent semantic analysis (LSA) is a generalised vector space method (GVSM) that uses dimension reduction to generate term correlations for use during the information retrieval process. We hypothesised that even though the dimension reduction establishes correlations between terms, the reduction is causing a degradation in the correlation of a term to itself (self-correlation). In this article, we have proven that there is a direct relationship to the size of the LSA dimension reduction and the LSA self-correlation. We have also shown that by altering the LSA term self-correlations we gain a significant increase in precision during the information retrieval process. 1
Edge Weight Regularization Over Multiple Graphs For Similarity Learning
"... Abstract—The growth of the web has directly influenced the increase in the availability of relational data. One of the key problems in mining such data is computing the similarity between objects with heterogeneous feature types. For example, publications have many heterogeneous features like text, ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract—The growth of the web has directly influenced the increase in the availability of relational data. One of the key problems in mining such data is computing the similarity between objects with heterogeneous feature types. For example, publications have many heterogeneous features like text, citations, authorship information, venue information, etc. In most approaches, similarity is estimated using each feature type in isolation and then combined in a linear fashion. However, this approach does not take advantage of the dependencies between the different feature spaces. In this paper, we propose a novel approach to combine the different sources of similarity using a regularization framework over edges in multiple graphs. We show that the objective function induced by the framework is convex. We also propose an efficient algorithm using coordinate descent [1] to solve the optimization problem. We extrinsically evaluate the performance of the proposed unified similarity measure on two different tasks, clustering and classification. The proposed similarity measure outperforms three baselines and a state-of-the-art classification algorithm on a variety of standard, large data sets.
W.M.: Leveraging higher order dependencies between features for text classification
, 2009
"... Abstract. Traditional machine learning methods only consider relationships between feature values within individual data instances while disregarding the dependencies that link features across instances. In this work, we develop a general approach to supervised learning by leveraging higher-order de ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Traditional machine learning methods only consider relationships between feature values within individual data instances while disregarding the dependencies that link features across instances. In this work, we develop a general approach to supervised learning by leveraging higher-order dependencies between features. We introduce a novel Bayesian framework for classification named Higher Order Naive Bayes (HONB). Unlike approaches that assume data instances are independent, HONB leverages co-occurrence relations between feature values across different instances. Additionally, we generalize our framework by developing a novel data-driven space transformation that allows any classifier operating in vector spaces to take advantage of these higher-order cooccurrence relations. Results obtained on several benchmark text corpora demonstrate that higher-order approaches achieve significant improvements in classification accuracy over the baseline (first-order) methods. Key words: machine learning, text classification, higher order learning, statistical relational learning, higher order naive bayes, higher order support vector machine 1
Experiments on the difference between semantic similarity and relatedness
- In: 17th Nordic Conference of Computational Linguistics NODALIDA. Northern European Association for Language Technology
, 2009
"... Recent work has pointed out the differ-ence between the concepts of semantic similarity and semantic relatedness. Im-portantly, some NLP applications depend on measures of semantic similarity, while others work better with measures of se-mantic relatedness. It has also been ob-served that methods of ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Recent work has pointed out the differ-ence between the concepts of semantic similarity and semantic relatedness. Im-portantly, some NLP applications depend on measures of semantic similarity, while others work better with measures of se-mantic relatedness. It has also been ob-served that methods of computing simi-larity measures from text corpora produce word spaces that are biased towards either semantic similarity or relatedness. De-spite these findings, there has been lit-tle work that evaluates the effect of vari-ous techniques and parameter settings in the word space construction from corpora. The present paper experimentally investi-gates how the choice of context, corpus preprocessing and size, and dimension re-duction techniques like singular value de-composition and frequency cutoffs influ-ence the semantic properties of the result-ing word spaces. 1
Query Expansion and Dimensionality Reduction: Notions of Optimality in Rocchio Relevance Feedback and Latent Semantic Indexing
, 2006
"... Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method’s basis in least-squares optimization. Noti ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method’s basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI’s and Rocchio’s notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI’s motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance. 1