Results 1 - 10
of
22
Indexing by latent semantic analysis
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
, 1990
"... A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The p ..."
Abstract
-
Cited by 2168 (30 self)
- Add to MetaCart
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 or-thogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are re-turned. initial tests find this completely automatic method for retrieval to be promising.
Machine Learning in Automated Text Categorization
- ACM Computing Surveys
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract
-
Cited by 839 (13 self)
- Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
A Comparison of Two Learning Algorithms for Text Categorization
- In Third Annual Symposium on Document Analysis and Information Retrieval
, 1994
"... This paper examines the use of inductive learning to categorize natural language documents into predefined content categories. Categorization of text is of increasing importance in information retrieval and natural language processing systems. Previous research on automated text categorization has m ..."
Abstract
-
Cited by 239 (1 self)
- Add to MetaCart
This paper examines the use of inductive learning to categorize natural language documents into predefined content categories. Categorization of text is of increasing importance in information retrieval and natural language processing systems. Previous research on automated text categorization has mixed machine learning and knowledge engineering methods, making it difficult to draw conclusions about the performance of particular methods. In this paper we present empirical results on the performance of a Bayesian classifier and a decision tree learning algorithm on two text categorization data sets. We find that both algorithms achieve reasonable performance and allow controlled tradeoffs between false positives and false negatives. The stepwise feature selection in the decision tree algorithm is particularly effective in dealing with the large feature sets common in text categorization. However, even this algorithm is aided by an initial prefiltering of features, confirming the results...
Information Extraction as a Basis for High-Precision Text Classification
- ACM Transactions on Information Systems
, 1994
"... this article. For the purpose of text classification, the answer keys serve only as a set of correct classifications for each text. If a text has instantiated key templates associated with it in the corpus, then it should be classified as a relevant text. If a text has no instantiated key templates ..."
Abstract
-
Cited by 102 (5 self)
- Add to MetaCart
this article. For the purpose of text classification, the answer keys serve only as a set of correct classifications for each text. If a text has instantiated key templates associated with it in the corpus, then it should be classified as a relevant text. If a text has no instantiated key templates associated with it (i.e., only a dummy template) then it should be classified as an irrelevant text. This is a binary classification problem: a text is either relevant to the terrorism domain or irrelevant. The texts were selected by keyword search from a database of newswire articles 2 because they contained words associated with terrorism. However, many of them did not mention any relevant terrorist incidents. Of the 1700 texts in the MUC4 corpus, only 53% described a relevant terrorist event. Because many of the texts in the corpus were irrelevant, the MUC-4 systems had to distinguish the relevant from the irrelevant texts. Although the MUC-4 task was information extraction, information detection 4 (i.e, text classification) was an implicit subtask. To be successful in MUC-4, the information extraction systems also had to be good at detection. Our MUC-4 system did not use a separate text classification module. Instead, we extracted information from every text and relied on a discourse analysis module to discard irrelevant templates. This strategy was very effective, 5 but it was expensive. A reliable text classification module could have filtered out irrele- 1MUC-3 was the Third Message Understanding ConferenCe held in 1991 [MUC-3 Proceedings 19911
Using Latent Semantic Analysis To Improve Access To Textual Information
- SIGCHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS
, 1988
"... This paper describes a new approach for dealing with the vocabulary problem in human-computer interaction. Most approaches to retrieving textual materials depend on a lexical match between words in users' requests and those in or assigned to database objects. Because of the tremendous diversity in t ..."
Abstract
-
Cited by 84 (1 self)
- Add to MetaCart
This paper describes a new approach for dealing with the vocabulary problem in human-computer interaction. Most approaches to retrieving textual materials depend on a lexical match between words in users' requests and those in or assigned to database objects. Because of the tremendous diversity in the words people use to describe the same object, lexical matching methods are necessarily incomplete and imprecise [5]. The latent semantic indexing approach tries to overcome these problems by automatically organizing text objects into a semantic structure more appropriate for matching user requests. This is done by taking advantage of implicit higher-order structure in the association of terms with text objects. The particular technique used is singular-value decomposition, in which a large term by text-object matrix is decomposed into a set of about 50 to 150 orthogonal factors from which the original matrix can be approximated by linear combination. Terms and objects are represented by 50 to 150 dimensional vectors and matched against user queries in this “semantic” space. Initial tests find this completely automatic method widely applicable and a promising way to improve users' access to many kinds of textual materials, or to objects and services for which textual descriptions are available.
A Multilevel Approach to Intelligent Information Filtering: Model, System, and Evaluation
- ACM Transactions on Information Systems
, 1997
"... this article, a filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties. A filtering system, SIFTER, has been implemented based on the model, using established techniques in info ..."
Abstract
-
Cited by 45 (5 self)
- Add to MetaCart
this article, a filtering model is proposed that decomposes the overall task into subsystem functionalities and highlights the need for multiple adaptation techniques to cope with uncertainties. A filtering system, SIFTER, has been implemented based on the model, using established techniques in information retrieval and artificial intelligence. These techniques include document representation by a vector-space model, document classification by unsupervised learning, and user modeling by reinforcement learning. The system can filter information based on content and a user's specific interests. The user's interests are automatically learned with only limited user intervention in the form of optional relevance feedback for documents. We also describe experimental studies conducted with SIFTER to filter computer and information science documents collected from the Internet and commercial database services. The experimental results demonstrate that the system performs very well in filtering documents in a realistic problem setting.
Large-Scale Information Retrieval with Latent Semantic Indexing
, 1997
"... . As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous collections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the ..."
Abstract
-
Cited by 40 (4 self)
- Add to MetaCart
. As the amount of electronic information increases, traditional lexical (or Boolean) information retrieval techniques will become less useful. Large, heterogeneous collections will be difficult to search since the sheer volume of unranked documents returned in response to a query will overwhelm the user. Vector-space approaches to information retrieval, on the other hand, allow the user to search for concepts rather than specific words and rank the results of the search according to their relative similarity to the query. One vector-space approach, Latent Semantic Indexing (LSI), has achieved up to 30% better retrieval performance than lexical searching techniques by employing a reduced-rank model of the term-document space. However, the original implementation of LSI lacked the execution efficiency required to make LSI useful for large data sets. A new implementation of LSI, LSI++, seeks to make LSI efficient, extensible, portable, and maintainable. The LSI++ Application Programming ...
Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval
, 1992
"... We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing ob ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
We have previously described an extension of the vector retrieval method called "Latent Semantic Indexing" (LSI) (Deerwester, et al., 1990; Dumais, et al., 1988; Furnas, et al., 1988). The LSI approach partially overcomes the problem of variability in human word choice by automatically organizing objects into a "semantic" structure more appropriate for information retrieval. This is done by modeling the implicit higher-order structure in the association of terms with objects. Initial tests find this completely automatic method to be a promising way to improve users' access to many kinds of textual materials or to objects for which textual descriptions are available. This paper describes some enhancements to the basic LSI method, including differential term weighting and relevance feedback. Appropriate term weighting improves performance by an average of 40%, and feedback based on 3 relevant documents improves performance by an average of 67%. September 1, 1992 D R A F T Dumais - 2 1....
An Automatic Hierarchical Image Classification Scheme
, 1998
"... Organizing images into semantic categories can be extremely useful for searching and browsing through large collections of images. Not much work has been done on automatic image classification, however. In this paper, we propose a method for hierarchical classification of images via supervised learn ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
Organizing images into semantic categories can be extremely useful for searching and browsing through large collections of images. Not much work has been done on automatic image classification, however. In this paper, we propose a method for hierarchical classification of images via supervised learning. This scheme relies on using a good low-level feature and subsequently performing feature-space reconfiguration using singular value decomposition to reduce noise and dimensionality. We use the training data to obtain a hierarchical classification tree that can be used to categorize new images. Our experimental results suggest that this scheme not only performs better than standard nearest-neighbor techniques, but also has both storage and computational advantages. 1 Introduction The proliferation of the world-wide web has given easy access to an explosively growing volume of visual data. Unfortunately, this data on the web is both scattered and unorganized, making search and retrieval...

