| Hull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th ACM/SIGIR Conference, pages 282--290. |
....was trained, the filtering system was used to filter documents from the unknown incoming stream without any further adaptation. Methods used for learning classifiers range from TF IDF (Term Frequency Inverse Document Frequency) vectors [Salton 1988] latent semantic indexing [Foltz 1990, Hull 1994, Foltz 1996] or probability theory [Lewis and Gale 1994, Lewis et al. 1996] The filtering queries may be constructed refined based on a method known as relevance feedback [Rocchio 1971] which incorporates relevant documents retrieved by the system into subsequent versions of the query to ....
Hull, D. Improving Text Retrieval for the Routing Problem using Latent Semantic Indexing Proceedings of the 17th International ACM-SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, 1994.
.... make them very suitable for industrial applications: linear classification complexity, a good degree of performance, and the possibility of being used within standard retrieval engines [21] 6 Rocchio is an algorithm typically used in IR for relevance feedback [19] It was first adapted to TC in [9], and since then it has been an important reference in ATC. The Rocchio algorithm produces a new weight vector wc k from an existing one wci k and a collection of training documents. The component i of the vector wc k is computed by the formula: wc ik = # wci ik # l#C k l ....
D.A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In W.B. Croft and C.J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval. Springer Werlag, Heidelberg, DE, 1994.
....to pictures based on their captions. Among other information, the users of the system can specify some Yahoo Spain categories to define their interests, which are later taken into account in the advanced picture search functions. The text categorization module is based on linear classifiers [8, 7, 14] and a program that mines Yahoo Spain web pages. This work is organized as follows. In the next section, we describe the main functionalities of the system, specially those related to user adapted searching. In the Section 3, we show the system architecture and how we have implemented its ....
....probabilistic classifiers like Naive Bayes [9, 11, 12] neural networks [6, 20] instancebased classifiers like kNN [9, 25] etc. See [21] for other approaches. An important subclass of learning approaches are those which learn linear classifiers, like Rocchio, Widrow Hoff, or Winnow algorithms [6, 7, 8, 14]. These approaches examine training instances a finite number of times, and construct a prototype instance (a term weight vector) for each category, which is latter compared to the instances to be classified. Linear classifiers show interesting properties that make them ideal for industrial ....
[Article contains additional citation context not shown here]
Hull, D.A. (1994) Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pp. 282-289, Springer Verlag, Heidelberg, DE.
....degrades the e# ciency of high dimensional query processing. A well studied solution to this problem is the dimensionality reduction approach which is a typical example of retrieved information reduction. Several researchers have used dimensionality reduction for scalable query performance [27, 33, 37, 48, 30]. For example, the dimensions of the feature vectors can be reduced to a desired value such that the underlying indexing technique performs more e#ectively. There is a trade o# between the accuracy obtained from the information stored in the index structure and the e#ciency. The most common ....
D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proc. of the 17th ACM-SIGIR Conference, pages 282--291, 1994.
....(S 1, S2) similar as what is done using the common approach. Another promotion of our research is to facilitate dimension reduction. Since the real data sets tend to be very large, much similarity measure related work focuses on retrieving partial information that centralizes the most energy [1][6][10] 13] for similarity measuring, indexing, querying, etc. However, there isn t any detailed and accurate evaluation of similarity measurements using different portions of information. The evaluation is of crucial impotence when the goal is to measure the similarity based on subjective feeling, ....
....between time series by using the whole information. In [1] 13] the Discrete Fourier transform (DFT) was applied on the time series, and then only the first few Fourier coefficients were preserved for dimension reduction. For the same purpose, discrete wavelet transform (DWT) was used in [6] and singular value decomposition was used in [10] These approaches use partial information, but the goal is to approximate the whole information and to reveal the prominent behaviors of the original time series. In our approach, which portion of information will be used is not determined by ....
D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. Proc. Of the 17 th ACM-SIGIR Conference, 1994.
....growing and pruning rather than the Minimum ESC principle. We refer to this algorithm as DL SC. An algorithm for learning stochastic decision lists on the basis of SC (equivalently MDL) was first proposed in [29] 4 Related Work Many methods have been proposed for text classification (e.g. [22, 21, 15, 6, 11, 12, 5, 31, 25, 28, 13, 3, 9, 14, 16, 24, 10, 27, 32, 17, 23]) We describe here two typical non rule based methods and two typical rule based method. We also experimentally compare these methods with our own. Naive Bayes In this method (e.g. 8] it is assumed that each category retains a multinominal distribution over a set of words, and that any text ....
David Hull. Improving text retrieval for the routing problem using latent semantic indexing. Proc. of SIGIR'9J, 1994.
....collections. We emphasize the query expansion interpretation of LSI and propose a LSI term normalization that achieves better performance on larger collections (TREC and NPL) 1 Introduction The use of Latent Semantic Indexing (LSI) has been proposed for text retrieval in several recent works [4, 6, 10, 2]. This technique uses the Singular Value Decomposition (SVD) 9] to project very high dimensional document and query vectors into a low dimensional space. In this new space it is reasoned that the underlying structure of the collection is revealed thus enhancing retrieval performance. Furthermore, ....
....of the collection is revealed thus enhancing retrieval performance. Furthermore, LSI can be alternatively reviewed as a query expansion method (see section 2.2 and 5) so that recall is generally improved. Experiments indicates both improved retrieval precision and recall when LSI is adopted. [4, 6, 10, 2, 1, 21]. LSI also improves text categorization [7, 20] and word sense disambiguation [17] Theoretical results [1, 14, 5, 21] have also provided some understanding on the e ectiveness of LSI. These LSI studies have, however, mostly used relatively small text collections and simpli ed document models. ....
D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th ACM/SIGIR Conference, pages 282-290, 1994. 11
....was addressed in many ways, but only a few of these have proven to be powerful tools. The techniques applied can be divided in some sub classes: probabilistic, decision trees, decision rules, regression models, neural networks, support vector machines, etc. 19] Examples in the literature are [7] using Rocchio Algorithm, 13] using C4.5 decision tree induction, 20] using k nearest neighbor algorithm, 8] using support vector machines, 22] using neural net works, etc. Generally, text categorization systems use a vector model representation of the documents. The vector that represents ....
.... introduced in 1995 by Vapnik, text categorization using support vector machines have also been proposed in the literature [8] Support vector machines proved to build very powerful systems, which is illustrated by the results presented in [8] and in [19] Another categorizer, Rocchio classifier [7], often used as a reference for comparison, uses Rocchio s formula for relevance feedback in the document vector space model. Since it was proposed in 1994 many variations have been implemented. Comparing text classifiers is a difficult and often subjective task. Designers tend to show the bright ....
D. A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In 17th ACM International Conference on Research and Development in Information Re- trieval (SIGIR-9J), pages 282-289, 1994.
....problems of deciding whether dj belongs or not to ci, for i: In our experiments we have used three different learning methods, which we have chosen with the aim of assembling a fairly representative sample of methods that allow weighted (non binary) input. The first is a standard Rocchio method [5] for learning linear classifiers. A classifier for category ci consists of a vector of weights wi = wkj wj IPOSl (5) I NEGI dj PO Si dj N EGi where Wkj is the weight oftk in document dj, POSi = dj Tr [ dj, ci) 1 and NEG = dj Tr I (dj, ci) 0 . Conforming to common practice ....
D. A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In W. B. Croft and C. J. van Rijsbergen, editors, Proceedings of SIGIR-9, 17th ACM International Conference on Research and Development in tnfovraation Retrieval, pages 282-289, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
....how LDA can be made feasible with high dimensional data. 6 Linear Discriminant Analysis for document term data To our knowledge, LDA feature transforms have not been applied earlier to document classification tasks, although LDA has been used before in the sense of designing a linear classifier [15, 27]. In contrast, we suggest LDA as a means of deriving efficient features, which can be classified by any, possibly nonlinear, classifier. If LDA is such a well known and well behaved method why is it not in wider use in the document analysis community LDA is, of course, only applicable when ....
David Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proc. SIGIR'94, pages 282--291, Dublin, Ireland, July 3-6 1994.
....labels to text items, is one of the most prominent text analysis and access tasks nowadays [19] We have implemented a text categorization module that automatically assigns Yahoo Spain subject labels to news items based on their text. The text categorization module is based on linear classifiers [6, 12, 14] and a program that mines Yahoo Spain web pages. We also apply information retrieval with the keywords against all the news items. 2 Modeling User s Interests In this section we focus on the process of modeling user s interests in our application setting. Users of any information access system ....
....[1, 8] probabilistic classifiers like Naive Bayes [13, 14] neural networks [9, 18] instance based classifiers like kNN [22] etc. See [19] for other approaches. An important subclass of learning approaches is that which learn linear classifiers, like Rocchio, Widrow Hoff, or Winnow algorithms [6, 12, 13, 18]. These approaches examine training instances a finite number of times, and construct a prototype instance (a term weight vector) for each category, which is latter compared to the instances to be classified. Linear classifiers show interesting properties that make them ideal for industrial ....
[Article contains additional citation context not shown here]
Hull, D.A. (1994) Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pp. 282-289, Springer Verlag, Heidelberg, DE.
....This method brings instance based learning closer to most other classifier induction methods, in which negative training instances play a fundamental role in the individuation of a best decision surface (i.e. classifier) that separates positive from negative instances. Even methods like Rocchio [3, 4], in which 4 Microaveraging Macroaveraging Proportional CSV Proportional CSV thresholding thresholding thresholding thresholding k Re P r F1 Re P r F1 Re P r F1 Re P r F1 05 .711 .823 .763 .682 .419 .519 .545 .716 .512 .563 .763 .544 10 .718 .830 .770 .676 .418 .517 .557 .721 .524 ....
....parameter value # = 1, which places equal emphasis on P r and Re. 4. 2 Feature selection experiments We have performed our feature selection experiments first with the standard k NN classifier of Section 3 (with k = 30) and subsequently with a Rocchio classifier we have implemented following [3, 4] (the Rocchio parameters were set to # = 16 and # = 4; see [3, 4, 12] for a full discussion of the Rocchio method) In these experiments we have compared two baseline feature selection functions, i.e. # avg (t k ) m X i=1 #(t k , c i ) P (c i ) # 2 max (t k ) m max i=1 # 2 (t k ....
[Article contains additional citation context not shown here]
D. A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 282--289, Dublin, IE, 1994.
....is analysed which provides an explanation for poor retrieval accuracy on large collections. We propose term normalization for LSI that achieves better performance. 1 Introduction The use of the Singular Value Decomposition (SVD) has been proposed for text retrieval in several recent works [2, 8]. This technique uses the SVD to project very high dimensional document and query vectors into a low dimensional space. In this new space it is hoped that the underlying structure of the collection is revealed thus enhancing retrieval performance. Theoretical results [14, 4] have provided some ....
D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th ACM/SIGIR Conference, pages 282--290, 1994.
....newly crawled pages. A positively classified page is viewed as a good page for the topic. In this sense this measure may be viewed as similar to precision where content based relevance is decided by the classifier. We use Widrow Ho# (WH) Exponentiated Gradient (EG) and Rocchio classifiers [23, 12, 11] with feature selection using Correlation Coe#cient [16] to select the best 50 features for each topic. The optimal threshold is set by maximizing the F1 score [22] on the training set. Due to limited space we refer the reader to [14, 25] for details on the classifiers. It may be observed that ....
D. A. Hull. Improving text retrieval for the routing problem using lat ent semantic indexing. In W. B. Croft and C. J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 282--289, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
.... databases [6] medical imaging [16] and multimedia information systems [18] The general approach is to represent the data objects as multidimensional points in Euclidean space, and to measure the similarity between objects by the distance between the corresponding multi dimensional points [13, 6]. It is assumed that the closer the points, the more similar the data objects. Since the dimensionality and the amount of data that need to be processed increases very rapidly, it becomes important to support ecient high dimensional similarity searching in large scale systems. This support depends ....
D. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proc. of the 17th ACM-SIGIR Conference, pp. 282-291, 1994.
....spaces) 5.7 Related Work A number of other researchers are using related linear algebra methods for information retrieval and classification work. Schutze [27] and Gallant [14] have used SVD and related dimension reduction ideas for word sense disambiguation and information retrieval work. Hull [17] and Yang and Chute [29] have used LSI SVD as the first step in conjunction with statistical classification (e.g. discriminant analysis) Using the LSI derived dimensions effectively reduces the number of predictor variables for classification. Wu et al. in [28] also used LSI SVD to reduce the ....
D. HULL, Improving text retrieval for the routing problem using Latent Semantic Indexing, in Proceedings of the Seventeenth Annual International ACM-SIGIR Conference, 1994, pp. 282--291.
No context found.
Hull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th ACM/SIGIR Conference, pages 282--290.
No context found.
Hull, D. (1994). Improving text retrieval for the routing problem using latent semantic indexing. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Dublin, Ireland, pages 282--291, New York. ACM Press.
No context found.
D. A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In W. B. Croft and C. J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 282--289, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
No context found.
David A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In W. Bruce Croft and Cornelis J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, pages 282--289, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
No context found.
D. Hull, "Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing," Proc. 17th ACM-SIGIR Conf., pp. 282291, 1994.
No context found.
D. Hull, "Improving Text Retrieval for the Routing Problem Using Latent Semantic Indexing," Proc. 17th ACM-SIGIR Conf., pp. 282291, 1994.
No context found.
D. A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In W. B. Croft and C. J. van Rijsbergen, editors, Proceedings of SIGIR-94, 17th ACM International Conference on pages 282-289, Dublin, IE, 1994. Springer Verlag, Heidelberg, DE.
No context found.
D.Hull. Improving text retrieval for the routing problem using latent semantic indexing. In Proc. of the 17th ACM-SIGIR Conference, pages 282--291, 1994.
No context found.
D. A. Hull. Improving text retrieval for the routing problem using latent semantic indexing. In 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR-94), pages 282--289, 1994.
First 50 documents Next 50
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC