Results 1 - 10
of
28
Machine Learning in Automated Text Categorization
- ACM Computing Surveys
, 2002
"... The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this p ..."
Abstract
-
Cited by 838 (13 self)
- Add to MetaCart
The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval
- In Proceedings of SIGIR’94
, 1994
"... The 2–Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are within-document term frequency, document length, and within-query term frequency. Simple weighting functions are develope ..."
Abstract
-
Cited by 289 (9 self)
- Add to MetaCart
The 2–Poisson model for term frequencies is used to suggest ways of incorporating certain variables in probabilistic models for information retrieval. The variables concerned are within-document term frequency, document length, and within-query term frequency. Simple weighting functions are developed, and tested on the TREC test collection. Considerable performance improvements (over simple inverse collection frequency weighting) are demonstrated. 1
Probabilistic Models in Information Retrieval
- The Computer Journal
, 1992
"... In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR alon ..."
Abstract
-
Cited by 87 (4 self)
- Add to MetaCart
In this paper, an introduction and survey over probabilistic information retrieval (IR) is given. First, the basic concepts of this approach are described: the probability ranking principle shows that optimum retrieval quality can be achieved under certain assumptions; a conceptual model for IR along with the corresponding event space clarify the interpretation of the probabilistic parameters involved. For the estimation of these parameters, three different learning strategies are distinguished, namely query-related, document-related and description-related learning. As a representative for each of these strategies, a specific model is described. A new approach regards IR as uncertain inference; here, imaging is used as a new technique for estimating the probabilistic parameters, and probabilistic inference networks support more complex forms of inference. Finally, the more general problems of parameter estimation, query expansion and the development of models for advanced document representations are discussed.
Automatic Essay Grading Using Text Categorization Techniques
- In Proceedings of SIGIR-98, 21st ACM International Conference on Research and Development in Information Retrieval
, 1998
"... The commas are the most useful and usable of all the stops. It is highly important to put them in place as you go along. If you try to come back after doing a paragraph and stick them in the various spots that tempt you you will discover that they tend to swarm like minnows into all sorts of crevice ..."
Abstract
-
Cited by 64 (3 self)
- Add to MetaCart
The commas are the most useful and usable of all the stops. It is highly important to put them in place as you go along. If you try to come back after doing a paragraph and stick them in the various spots that tempt you you will discover that they tend to swarm like minnows into all sorts of crevices whose existence you hadnt realized and before you know it the whole long sentence becomes immobilized and lashed up squirming in commas. Better to use them sparingly, and with affection precisely when the need for one arises, nicely, by itself.
"Is This Document Relevant? ...Probably": A Survey of Probabilistic Models in Information Retrieval
, 2001
"... This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the developmen ..."
Abstract
-
Cited by 55 (12 self)
- Add to MetaCart
This article surveys probabilistic approaches to modeling information retrieval. The basic concepts of probabilistic approaches to information retrieval are outlined and the principles and assumptions upon which the approaches are based are presented. The various models proposed in the development of IR are described, classified, and compared using a common formalism. New approaches that constitute the basis of future research are described
A Risk Minimization Framework for Information Retrieval
- IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM
, 2003
"... This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preference ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.
A New Probabilistic Model of Text Classification and Retrieval
, 1996
"... This paper introduces the multinomial model of text classification and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accoun ..."
Abstract
-
Cited by 31 (0 self)
- Add to MetaCart
This paper introduces the multinomial model of text classification and retrieval. One important feature of the model is that the tf statistic, which usually appears in probabilistic IR models as a heuristic, is an integral part of the model. Another is that the variable length of documents is accounted for, without either making a uniform length assumption or using length normalization. The multinomial model employs independence assumptions which are similar to assumptions made in previous probabilistic models, particularly the binary independence model and the 2-Poisson model. The use of simulation to study the model is described. Performance of the model is evaluated on the TREC-3 routing task. Results are compared with the binary independence model and with the simulation studies.
Probabilistic Information Retrieval as Combination of Abstraction, Inductive Learning and Probabilistic Assumptions
, 1994
"... We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
We show that former approaches in probabilistic information retrieval are based on one or two of the three concepts abstraction, inductive learning and probabilistic assumptions, and we propose a new approach which combines all three concepts. This approach is illustrated for the case of indexing with a controlled ...
The Effects Of Query Complexity, Expansion And Structure On Retrieval Performance In Probabilistic Text Retrieval
- University of Tampere
, 1999
"... ueries using all search facets identified from requests, low complexity was achieved by formulating queries with major facets only. Query expansion was based on a thesaurus, from which the expansion keys were elicited for queries. There were five expansion types: (1) the first query version was an u ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
ueries using all search facets identified from requests, low complexity was achieved by formulating queries with major facets only. Query expansion was based on a thesaurus, from which the expansion keys were elicited for queries. There were five expansion types: (1) the first query version was an unexpanded, original query with one search key for each search concept (original search concepts) elicited from the test thesaurus; (2) the synonyms of the original search keys were added to the original query; (3) search keys representing the narrower concepts of the original search concepts were added to the original query; (4) search keys representing the associative concepts of the original search concepts were added to the original query; (5) all previous expansion keys were cumulatively added to the original query. Query structure refers to the syntactic structure of a query expression, marked with query operators and parentheses. The structure of queries was either weak (queries with n

