| N. Fuhr, "Models for Retrieval with Probabilistic Indexing", Information Processing and Management, 25(1), pages 55-72, 1989. |
....have been developed that more or less gracefully integrate term frequency and document length information into the BIM itself. The widely used probabilistic indexing approach assumes there is an ideal binary indexing of the document, for which the observed index term occurrences provide evidence [7, 13]. Retrieval or classification is based on computing (or approximating) the expected value of the posterior log odds. The expectation is taken with respect to the probabilities of various ideal indexings. While this is a plausible approach, in practice the probabilities of the ideal indexings are ....
Norbert Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55-72, 1989.
....data to estimate the probability that a binary document or query term is represented correctly, and if that seemed improbable, then that bit would be flipped from zero to one or vice versa. This is the corollary to VSM document and query modifications in response to relevance feedback data. Fuhr [7] points exactly in this direction, by explicitly modelling P(x dm ) the probability that representation x is correct for document dm , separately from the probability of relevance P(R x, f k ) That model contributed to the development of our approach. We went one step further: instead of ....
Fuhr, N., Models for Retrieval with Probabilistic Indexing. Information Processing and Management, 1989. 25(1): p. 55-72.
....its document vector has the highest cosine. The Probabilistic TFIDF classifier (Joachims, 1997) is a probabilistic version of the TFIDF classifier, based on estimation of the probability of a category C given document d, P r(C d) using the retrieval with probabilistic indexing method proposed in (Fuhr, 1989). To classify a new document d, P r(C j d) is estimated for each class, C j , as described in more detail by Joachims (Joachims, 1997) d is assigned to the class whose probability is the highest. The Maximum Entropy(ME) classifier for text classification estimates the conditional ....
Fuhr, N. (1989). Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1), 55--72.
....notion, Cooper and Maron [33] proposed the idea of probabilistic indexing of documents in which index terms are given probabilistic weights based on the relevance of these index terms to queries likely to be given to the retrieval system. Much more recent work in probabilistic indexing by Fuhr [56, 57] also treats the retrieval problem as one of making probabilistic inferences about the relevance of documents to a query, and examines this problem from the different viewpoints of the query and the document. The work along these lines closest in spirit to our own is that of van Rijsbergen [166] ....
....notions of document overlap, based on equivalence classes of words (e.g. synonyms) phrases, or, in general, any function on groups of words in the corpus. In this way, our score can capture the CHAPTER 6. A NEW MODEL FOR DOCUMENT CLUSTERING 94 full generality of probabilistic indexing [56] techniques used in other tasks (such as document retrieval) This extension can be performed by computing the expected document overlap as a sum over multiple multinomial distributions (one for each set of mutually exclusive functional events) For example, say we wished to consider both single ....
Fuhr, N. Models for retrieval with probabilistic indexing. Information Processing and Management 25, 1 (1989), 55--72.
....Calculation of Similaritybetween Two Documents A document filtering system based on probabilistic models calculates a posterior probability P (cjd) the probability that a user s profile d is classified into a cluster c. Many methods of calculating posterior probability have been proposed[2] 3][4][5] Our patent retrieval system adopts Iwayama s formulation because it has the following advantages over other calculation methods. 1) it considers within document term frequencies. 2) it considers term weighting for incoming documents. 3) it is less affected by having an insufficient ....
N.Fuhr: "Models for Retrieval with Probabilistic Indexing", Information Processing and Retrieval, 25(1), pp.55-72, 1989.
....have been developed that more or less gracefully integrate term frequency and document length information into the BIM itself. The widely used probabilistic indexing approach assumes there is an ideal binary indexing of the document, for which the observed index term occurrences provide evidence [7, 13]. Retrieval or classification is based on computing (or approximating) the expected value of the posterior log odds. The expectation is taken with respect to the probabilities of various ideal indexings. While this is a plausible approach, in practice the probabilities of the ideal indexings are ....
Norbert Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55--72, 1989.
....for data on the Web. 1.1 Relation to Other work Our PRT algorithm is similar to the algorithm used by Maron [28] The main di erence is that Maron used a small number of features, manually selected, while we use the full document vocabulary. Other variants of this method were used in [12, 21, 24, 33]. The main SE algorithm we examine was recently introduced by Cohen and Singer [6] In addition, we introduce a novel context sensitive variant of the algorithm and a feature reduction mechanism. The datasets we used represent generic text classi cation problems. Web pages ltering is relatively ....
.... of these distributions, the PRT algorithm is an optimal classi er [9] Although this independence assumption is obviously violated in natural language text (see discussions in Cooper [8] and Lewis [23] variants of this algorithm have been applied successfully in variety of IR tasks (see e.g. [28, 12, 21, 24, 33]) In addition to the basic algorithm (pure PRT) we considered its following variations. Sequential application. In the classical PRT hypothesis testing procedure [35, 9] one speci es a significance level parameter which prescribes two thresholds U( log 1 and L( log 1 ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing & Management, 25(1):55-72, 1989.
....classification. That is because we somewhat directly adapt the Nave Bayes Classifier described in [17] into the document classification in this work with little variation on text representation. The evaluation of performance of the variations of this learning algorithm has been described in [4, 13, 17]. In this context, however, it would be apparent that the identifying passage can fail or be misled when the incorrect text category knowledge is given, and this method is suspect or powerless if the knowledge is incorrect or unavailable. Therefore, identifying a passage is based on such a simple ....
N. Fuhr. Models for Retrieval with Probabilistic Indexing. Information Processing and Management. 25(1), p55-72, 1989.
....its document vector has the highest cosine. The Probabilistic TFIDF classifier [Joachims, 1997] is a probabilistic version of the TFIDF classifier, based on estimation of the probability of a category C given document d, P r(C d) using the retrieval with probabilistic indexing method proposed in [Fuhr, 1989] . To classify a new document d, P r(C j d) is estimated for each class, C j , as described in more detail by Joachims [1997] d is assigned to the class whose probability is the highest. Ripper [Cohen, 1995; 1996] is a learning method that forms sets of simple rules for data described by ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55--72, 1989.
....training data instead of non structured textual data. This motivated many approaches to document classification use corpus to characterize documents and develops new algorithms to learn classification knowledge. These algorithms include Bayesian independence classifier [21] knearest neighbor [22, 32], rule based induction algorithm [10] and mixed approached such as INQUERY [33] Those systems concentrate on the document categorization and the learning algorithm, but they omit the diversity of the semantics of terms (or features) in the document. In machine learning, the feature is usually an ....
N. Fuhr, "Models for Retrieval with Probabilistic Indexing", Information Processing and Management, Vol. 25, No. 1, 1989, pp. 55-72.
....with similar content have similar vectors. The Probabilistic TFIDF classifier [7] is a probabilistic version of the TFIDF classifier, based on estimation of the probability of a category C j given document d, P r(C j d) using the retrieval with probabilistic indexing method proposed in [6]. Ripper [4, 5] is a learning method that forms sets of simple rules for data described by sets of attribute value pairs. Each rule tests a conjunction of conditions on attribute values. Rules are returned as an ordered list, and the first successful rule provides the prediction for the class ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55--72, 1989.
....between C i and d given t, that is P (C i jt; d) P (C i jt) we obtain Eq. 9) P (C i jd) X t P (C i jt)P (tjd) 9) Using Bayes rule, we finally obtain Eq. 10) P (C i jd) P (C i ) X t P (tjC i )P (tjd) P (t) 10) This formulation is different from the one proposed in [3, 15]. The details of this formulation is discussed elsewhere [16] Here P (tjC i ) is the probability that a randomly selected term in the category C i is the term t. P (tjd) is the probability that a randomly selected term in the text t is the term t. P (t) and P (C i ) are the prior probabilities of ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing & Management, Vol. 25, No. 1, pp. 55--72, 1989.
....consistently and efficiently to large numbers of daily incoming documents. The purpose of this paper is to propose a new probabilistic model for automatic text categorization. While many text categorization models have been proposed so far, in this paper, we concentrate on the probabilistic models [12, 8, 6, 9, 3, 17, 18] because these models have solid formal grounding in probability theory. Section 2 quickly reviews the probabilistic models and lists their individual problems. In section 3, we propose a new probabilistic model based on a Single random Variable with Multiple Values (SVMV) Our model is very ....
....i has, the more probably it will be categorized into category c. This is called the Probabilistic Ranking Principle (PRP) 11] Several strategies can be used to assign categories to a document based on PRP [9] There are several ways to calculate P (cjd) Three representatives are [12] 8] and [6]. 2.1 Probabilistic Relevance Weighting (PRW) Robertson and Sparck Jones [12] make use of the well known logistic (or log odds) transformation of the probability P (cjd) g(cjd) log P (cjd) P (cjd) 2) where c means not c , that is a document is not categorized into c. Since this is a ....
[Article contains additional citation context not shown here]
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing & Retrieval, Vol. 25, No. 1, pp. 55--72, 1989.
.... is a general form of the well known Maximum Likelihood estimation, and we call the algorithm Hierarchical Bayesian Clustering (HBC) Probabilistic models are becoming popular in the field of text retrieval categorization owing to their solid formal grounding in probability theory [ Croft, 1981, Fuhr, 1989, Kwok, 1990, Lewis, 1992 ] They retrieve those texts that have larger posterior probabilities of being relevant to a request. When these models are extended to cluster based text retrieval categorization, however, the algorithm used for text clustering has still been a non probabilistic one [ ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing & Retrieval, 25(1):55--72, 1989.
....optimizes both the cost and the effectiveness of the retrieval system. The importance of the probability ranking principle comes from the fact that it can be proven mathematically. Two well known probabilistic retrieval methods are the Binary Independence Retrieval (BIR) model [RSJ76] Fuh89] and the Binary Independence Indexing (BII) model [FB91] The BIR model assigns probabilistic weights to query features whereas the BII model assigns probabilistic weights to document features. The probabilistic parameters are computed by means of a test collection. Both the BII and the BIR ....
N. Fuhr. Models for Retrieval with Probabilistic Indexing. Information Processing & Management, 25(1):55--72, 1989.
....of the classifier by examining the structure of the page s URL. Below we describe these two steps in turn. 4.1 Using Word Vectors to Classify Web Pages Word vector based methods represent a document as a vector, with one entry for each word in the vocabulary. The Probabilistic Indexing approach [4] used in this paper classifies a new document d by summing, over all words in d , the probability that the word is representative of both the document and the class. More precisely, it assigns the class C to document d according to the following rule: C = argmax 8C Pr(Cjd ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25:55--72, 1989.
....most NLP tasks. TR, even more than DR, is tolerant with respect to errors in document representations. In addition, ambiguities in NLP system output (for instance, alternative decompositions of a sentence into phrases) can be assigned probabilities of correctness in a probabilistic indexing method [11]. On the other hand, NLP applied to documents must cope with vast amounts of variable quality text from broad domains. User requests present smaller amounts of text, but even more variability in form and content. Each of the three main aspects of our strategy forming text descriptions, providing ....
Fuhr, N. Models for retrieval with probabilistic indexing. Inf. Process. Manage., 25, 1 (1989), 55--72.
....be assigned those terms that are used by queries to which the document is relevant. With this model, the notion of weighted indexing (instead of binary indexing) that is the weighting of the index terms w.r.t. the document, was given a theoretical justification in terms of probabilities. In [13], this approach is generalized to all models of probabilistic indexing by introducing the concept of correctness as the event to which the probabilities relate. The Maron and Kuhns model assumes that the probabilistic indexing weights for a document can be estimated on the basis of relevance ....
....range: in the case of search term weighting from relevance feedback, the relevance information collected for one query is worthless for any other query. In the same way, the probabilistic indexing approach restricts the use of relevance data to a single document. The Darmstadt Indexing Approach [13] [3] overcomes these deficiencies by introducing the concept of relevance descriptions: a relevance description is an abstraction from specific queries, documents and terms. Like in pattern recognition methods, a relevance description contains values of features of the objects under consideration ....
[Article contains additional citation context not shown here]
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55--72, 1989.
....In the following, we will assume that P (I d m ) is the same for all documents; so we only have to estimate the parameters P (I t i ,d m ) A direct estimation of these parameters would su#er from the same problems as described before. Instead, we apply the so called description oriented approach [5]. Here the basic idea is the abstraction from specific terms and documents. Instead, we regard feature vectors x(t i ,d m ) of term document pairs, and we estimate probabilities P (I x(t i ,d m ) referring to these vectors. The di#erences between the two strategies are illustrated in figure 9. A ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25(1):55--72, 1989.
....methods for coping with imprecision in databases [IEEE 89, Motro 90] As new databases for technical, scientific and office applications are set up, this issue becomes of increasing importance. A first probabilistic model that can handle both vague queries and imprecise data has been presented in [Fuhr 90]. Furthermore, the integration of text and fact retrieval will be a major issue (see e.g. Rabitti Savino 90] Finally, it should be mentioned that the models discussed here do scarcely take into account the special requirements of interactive retrieval. Even the feedback methods are more or ....
Fuhr, N. (1989a). Models for Retrieval with Probabilistic Indexing. Information Processing and Management 25(1), pages 55--72.
No context found.
Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management 25(1), pages 55--72.
....learning strategy. 3 A new probabilistic model for the Darmstadt Indexing Approach The Darmstadt Indexing Approach (DIA) is a dictionary based approach for automatic indexing from document titles and abstracts, with index terms (called descriptors here) from a prescribed indexing vocabulary ([8] [11] This means that a descriptor may be assigned to a document even when it does not occur in the document text. For the task of mapping text content onto the set of descriptors, the approach needs an indexing dictionary containing term descriptor rules for as many terms (i.e. words or ....
N. Fuhr. Models for retrieval with probabilistic indexing. Information Processing and Management, 25, 1 (1989), 55--72.
No context found.
N. Fuhr, "Models for Retrieval with Probabilistic Indexing", Information Processing and Management, 25(1), pages 55-72, 1989.
No context found.
Fuhr, Norbert, "Models for Retrieval with Probabilistic Indexing," Information Processing & Management, 25, 1, 1989, pp. 55-72. 22
No context found.
Fuhr, N. (1989a). Models for Retrieval with probabilistic Indexing. Information Processing and Management 25(1), pages 55-72.
First 50 documents
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC