Results 1 - 10
of
57
Training Algorithms for Linear Text Classifiers
, 1996
"... Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides pe ..."
Abstract
-
Cited by 276 (12 self)
- Add to MetaCart
Systems for text retrieval, routing, categorization and other IR tasks rely heavily on linear classifiers. We propose that two machine learning algorithms, the Widrow-Hoff and EG algorithms, be used in training linear text classifiers. In contrast to most IR methods, theoretical analysis provides performance guarantees and guidance on parameter settings for these algorithms. Experimental data is presented showing Widrow-Hoff and EG to be more effective than the widely used Rocchio algorithm on several categorization and routing tasks. 1 Introduction Document retrieval, categorization, routing, and filtering systems often are based on classification. That is, the IR system decides for each document which of two or more classes it belongs to, or how strongly it belongs to a class, in order to accomplish the IR task of interest. For instance, the two classes may be the documents relevant to and not relevant to a particular user, and the system may rank documents based on how likely it i...
Query-Based Sampling of Text Databases
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1999
"... ... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive ..."
Abstract
-
Cited by 218 (14 self)
- Add to MetaCart
... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are created, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic database selection.
On-line New Event Detection and Tracking
, 1998
"... We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a singl ..."
Abstract
-
Cited by 206 (5 self)
- Add to MetaCart
(Show Context)
We define and describe the related problems of new event detection and event tracking within a stream of broadcast news stories. We focus on a strict on-line setting-i.e., the system must make decisions about one story before looking at any subsequent stories. Our approach to detection uses a single pass clustering algorithm and a novel thresholding model that incorporates the properties of events as a major component. Our ap-proach to tracking is similar to typical information filtering methods. We discuss the value of “surprising” features that have unusual occurrence characteristics, and briefly explore on-line adaptive filtering to handle evolving events in the news. New event detection and event tracking are part of the Topic Detection and Tracking (TDT) initiative.
Distributed Information Retrieval
- In: Advances in Information Retrieval
, 2000
"... A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to sa ..."
Abstract
-
Cited by 187 (20 self)
- Add to MetaCart
A multi-database model of distributed information retrieval is presented, in which people are assumed to have access to many searchable text databases. In such an environment, full-text information retrieval consists of discovering database contents, ranking databases by their expected ability to satisfy the query, searching a small number of databases, and merging results returned by different databases. This paper presents algorithms for each task. It also discusses how to reorganize conventional test collections into multi-database testbeds, and evaluation methodologies for multi-database experiments. A broad and diverse group of experimental results is presented to demonstrate that the algorithms are effective, efficient, robust, and scalable. 1. INTRODUCTION Wide area networks, particularly the Internet, have transformed how people interact with information. Much of the routine information access by the general public is now based on full-text information retrieval, as opposed t...
Combining Classifiers in Text Categorization
, 1996
"... Three different types of classifiers were investigated in the context of a text categorization problem in the medical domain: the automatic assignment of ICD9 codes to dictated inpatient discharge summaries. K-nearest-neighbor, relevance feedback, and Bayesian independence classifers were applied in ..."
Abstract
-
Cited by 163 (7 self)
- Add to MetaCart
(Show Context)
Three different types of classifiers were investigated in the context of a text categorization problem in the medical domain: the automatic assignment of ICD9 codes to dictated inpatient discharge summaries. K-nearest-neighbor, relevance feedback, and Bayesian independence classifers were applied individually and in combination. A combination of different classifiers produced better results than any single type of classifier. For this specific medical categorization problem, new query formulation and weighting methods used in the k-nearest-neighbor classifier improved performance. 1 Introduction Past research in information retrieval has shown that one can improve retrieval effectiveness by using multiple representations in indexing and query formulation [27] [19] [3] [11] and by using multiple search strategies [5] [24] [7]. In this work, we investigate whether we can attain similar improvements in the domain of text categorization by combining different representations and classif...
Overview of the Fifth Text REtrieval Conference (TREC-5)
- PROCEEDINGS OF THE FIFTH TEXT RETRIEVAL CONFERENCE (TREC-5
, 1997
"... ..."
Incremental Relevance Feedback for Information Filtering
, 1996
"... We use data from the TREC routing experiments to explore how relevance feedback can be applied incrementally --- using a few judged documents each time --- to achieve results that are as good as if the feedback occurred in one pass. We show that relatively few judgments are needed to get highquality ..."
Abstract
-
Cited by 105 (4 self)
- Add to MetaCart
We use data from the TREC routing experiments to explore how relevance feedback can be applied incrementally --- using a few judged documents each time --- to achieve results that are as good as if the feedback occurred in one pass. We show that relatively few judgments are needed to get highquality results. We also demonstrate methods that reduce the amount of information archived from past judged documents without adversely affecting effectiveness. A novel simulation shows that such techniques are useful for handling long-standing queries with drifting notions of relevance.
Inferring query performance using pre-retrieval predictors
- In Proc. Symposium on String Processing and Information Retrieval
, 2004
"... Abstract. The prediction of query performance is an interesting and important issue in Information Retrieval (IR). Current predictors involve the use of relevance scores, which are time-consuming to compute. Therefore, current predictors are not very suitable for practical applications. In this pape ..."
Abstract
-
Cited by 104 (6 self)
- Add to MetaCart
(Show Context)
Abstract. The prediction of query performance is an interesting and important issue in Information Retrieval (IR). Current predictors involve the use of relevance scores, which are time-consuming to compute. Therefore, current predictors are not very suitable for practical applications. In this paper, we study a set of predictors of query performance, which can be generated prior to the retrieval process. The linear and non-parametric correlations of the predictors with query performance are thoroughly assessed on the TREC disk4 and disk5 (minus CR) collections. According to the results, some of the proposed predictors have significant correlation with query performance, showing that these predictors can be useful to infer query performance in practical applications. 1
An Analysis of Statistical and Syntactic Phrases
, 1997
"... As the amount of textual information available through the World Wide Web grows, there is a growing need for high-precision IR systems that enable a user to find useful information from the masses of available textual data. Phrases have traditionally been regarded as precision-enhancing devices and ..."
Abstract
-
Cited by 89 (2 self)
- Add to MetaCart
(Show Context)
As the amount of textual information available through the World Wide Web grows, there is a growing need for high-precision IR systems that enable a user to find useful information from the masses of available textual data. Phrases have traditionally been regarded as precision-enhancing devices and have proved useful as content-identifiers in representing documents. In this study, we compare the usefulness of phrases recognized using linguistic methods and those recognized by statistical techniques. We focus in particular on high-precision retrieval. We discover that once a good basic ranking scheme is being used, the use of phrases does not have a major effect on precision at high ranks. Phrases are more useful at lower ranks where the connection between documents and relevance is more tenuous. Also, we find that the syntactic and statistical methods for recognizing phrases yield comparable performance. 1 Introduction The amount of textual information available through the World Wide...
Term proximity scoring for keyword-based retrieval systems
- In Proc. of the 25th European Conf. on IR Research
, 2003
"... Abstract. This paper suggests the use of proximity measurement in combination with the Okapi probabilistic model. First, using the Okapi system, our investigation was carried out in a distributed retrieval framework to calculate the same relevance score as that achieved by a single centralized index ..."
Abstract
-
Cited by 71 (3 self)
- Add to MetaCart
(Show Context)
Abstract. This paper suggests the use of proximity measurement in combination with the Okapi probabilistic model. First, using the Okapi system, our investigation was carried out in a distributed retrieval framework to calculate the same relevance score as that achieved by a single centralized index. Second, by applying a term-proximity scoring heuristic to the top documents returned by a keyword-based system, our aim is to enhance retrieval performance. Our experiments were conducted using the TREC8, TREC9 and TREC10 test collections, and show that the suggested approach is stable and generally tends to improve retrieval effectiveness especially at the top documents retrieved. 1