Results 1 - 10
of
1,977
Indexing by latent semantic analysis
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE
, 1990
"... A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The p ..."
Abstract
-
Cited by 2168 (30 self)
- Add to MetaCart
A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 or-thogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are re-turned. initial tests find this completely automatic method for retrieval to be promising.
Term-weighting approaches in automatic text retrieval
- INFORMATION PROCESSING AND MANAGEMENT
, 1988
"... The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucia ..."
Abstract
-
Cited by 1216 (9 self)
- Add to MetaCart
The experimental evidence accumulated over the past 20 years indicates that text indexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term-weighting systems. This article summarizes the insights gained in automatic term weighting, and provides baseline single-term-indexing models with which other more elaborate content analysis procedures can be compared.
Empirical Analysis of Predictive Algorithm for Collaborative Filtering
- Proceedings of the 14 th Conference on Uncertainty in Artificial Intelligence
, 1998
"... 1 ..."
Probabilistic Latent Semantic Indexing
, 1999
"... Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized ..."
Abstract
-
Cited by 545 (7 self)
- Add to MetaCart
Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. Fitted from a training corpus of text documents by a generalization of the Expectation Maximization algorithm, the utilized model is able to deal with domain-specific synonymy as well as with polysemous words. In contrast to standard Latent Semantic Indexing (LSI) by Singular Value Decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. Retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methodsaswell as over LSI. In particular, the combination of models with different dimensionalities has proven to be advantageous.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
, 1998
"... The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the d-dimens ..."
Abstract
-
Cited by 533 (28 self)
- Add to MetaCart
The nearest neighbor problem is the following: Given a set of n points P = fp 1 ; : : : ; png in some metric space X, preprocess P so as to efficiently answer queries which require finding the point in P closest to a query point q 2 X. We focus on the particularly interesting case of the d-dimensional Euclidean space where X = ! d under some l p norm. Despite decades of effort, the current solutions are far from satisfactory; in fact, for large d, in theory or in practice, they provide little improvement over the brute-force algorithm which compares the query point to each data point. Of late, there has been some interest in the approximate nearest neighbors problem, which is: Find a point p 2 P that is an ffl-approximate nearest neighbor of the query q in that for all p 0 2 P , d(p; q) (1 + ffl)d(p 0 ; q). We present two algorithmic results for the approximate version that significantly improve the known bounds: (a) preprocessing cost polynomial in n and d, and a trul...
Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections
, 1992
"... Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably ..."
Abstract
-
Cited by 519 (12 self)
- Add to MetaCart
Document clustering has not been well received as an information retrieval tool. Objections to its use fall into two main categories: first, that clustering is too slow for large corpora (with running time often quadratic in the number of documents); and second, that clustering does not appreciably improve retrieval. We argue that these problems arise only when clustering is used in an attempt to improve conventional search techniques. However, looking at clustering as an information access tool in its own right obviates these objections, and provides a powerful new access paradigm. We present a document browsing technique that employs document clustering as its primary operation. We also present fast (linear time) clustering algorithms which support this interactive browsing paradigm. 1 Introduction Document clustering has been extensively investigated as a methodology for improving document search and retrieval (see [15] for an excellent review). The general assumption is that mutua...
Relevance Feedback: A Power Tool for Interactive Content-Based Image Retrieval
, 1998
"... Content-Based Image Retrieval (CBIR) has become one of the most active research areas in the past few years. Many visual feature representations have been explored and many systems built. While these research efforts establish the basis of CBIR, the usefulness of the proposed approaches is limited. ..."
Abstract
-
Cited by 422 (33 self)
- Add to MetaCart
Content-Based Image Retrieval (CBIR) has become one of the most active research areas in the past few years. Many visual feature representations have been explored and many systems built. While these research efforts establish the basis of CBIR, the usefulness of the proposed approaches is limited. Specifically, these efforts have relatively ignored two distinct characteristics of CBIR systems: (1) the gap between high level concepts and low level features; (2) subjectivity of human perception of visual content. This paper proposes a relevance feedback based interactive retrieval approach, which effectively takes into account the above two characteristics in CBIR. During the retrieval process, the user's high level query and perception subjectivity are captured by dynamically updated weights based on the user's feedback. The experimental results over more than 70,000 images show that the proposed approach greatly reduces the user's effort of composing a query and captures the user's i...
Inductive Learning Algorithms and Representations for Text Categorization
, 1998
"... Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categori ..."
Abstract
-
Cited by 419 (9 self)
- Add to MetaCart
Text categorization – the assignment of natural language texts to one or more predefined categories based on their content – is an important component in many information organization and management tasks. We compare the effectiveness of five different automatic learning algorithms for text categorization in terms of learning speed, realtime classification speed, and classification accuracy. We also examine training set size, and alternative document representations. Very accurate text classifiers can be learned automatically from training examples. Linear Support Vector Machines (SVMs) are particularly promising because they are very accurate, quick to train, and quick to evaluate. 1.1 Keywords Text categorization, classification, support vector machines, machine learning, information management.
Efficient and Effective Querying by Image Content
- Journal of Intelligent Information Systems
, 1994
"... In the QBIC (Query By Image Content) project we are studying methods to query large on-line image databases using the images' content as the basis of the queries. Examples of the content we use include color, texture, and shape of image objects and regions. Potential applications include medical ..."
Abstract
-
Cited by 393 (11 self)
- Add to MetaCart
In the QBIC (Query By Image Content) project we are studying methods to query large on-line image databases using the images' content as the basis of the queries. Examples of the content we use include color, texture, and shape of image objects and regions. Potential applications include medical ("Give me other images that contain a tumor with a texture like this one"), photo-journalism ("Give me images that have blue at the top and red at the bottom"), and many others in art, fashion, cataloging, retailing, and industry. We describe a set of novel features and similarity measures allowing query by color, texture, and shape of image object. We demonstrate the effectiveness of the QBIC system with normalized precision and recall experiments on test databases containing over 1000 images and 1000 objects populated from commercially available photo clip art images, and of images of airplane silhouettes. We also consider the efficient indexing of these features, specifically addre...
An Efficient Boosting Algorithm for Combining Preferences
, 1999
"... The problem of combining preferences arises in several applications, such as combining the results of different search engines. This work describes an efficient algorithm for combining multiple preferences. We first give a formal framework for the problem. We then describe and analyze a new boosting ..."
Abstract
-
Cited by 383 (13 self)
- Add to MetaCart
The problem of combining preferences arises in several applications, such as combining the results of different search engines. This work describes an efficient algorithm for combining multiple preferences. We first give a formal framework for the problem. We then describe and analyze a new boosting algorithm for combining preferences called RankBoost. We also describe an efficient implementation of the algorithm for certain natural cases. We discuss two experiments we carried out to assess the performance of RankBoost. In the first experiment, we used the algorithm to combine different WWW search strategies, each of which is a query expansion for a given domain. For this task, we compare the performance of RankBoost to the individual search strategies. The second experiment is a collaborative-filtering task for making movie recommendations. Here, we present results comparing RankBoost to nearest-neighbor and regression algorithms.

