Results 1 - 10
of
18
Stylistic Experiments For Information Retrieval
, 2000
"... Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topi ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
Information retrieval systems are built to handle texts as topical items: texts are tabulated by occurrence frequencies of content words in them, under the assumption that text topic is reasonably well modeled by content word occurrence. But texts have several interesting characteristics beyond topic. The experiments described in this text investigate stylistic variation. Roughly put, style is the difference between two ways of saying the same thing -- and systematic stylistic variation can be used to characterize the genre of documents. These experiments investigate if stylistic information is distinguishable using simple language engineering methods, and if in that case this type of information can be used to improve information retrieval systems.
Effective Use of Natural Language Processing Techniques for Automatic Conflation of Multi-Word Terms: The Role of Derivational Morphology, Part of Speech Tagging, and Shallow Parsing
- In Research and Development in Information Retrieval
"... We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based ..."
Abstract
-
Cited by 20 (3 self)
- Add to MetaCart
We present a corpus-based system to expand multi-word index terms using a part-of-speech tagger and a full-fledged derivational morphological system, combined with a shallow parser. The system has been applied to French. The unique contribution of the research is in using these linguistically based tools with safety filters in order to avoid the problems of degradation typically associated with derivational analysis and generation. The successful expansion and thus conflation of terms, increases indexing coverage up to 30% with precision of nearly 90% for correct identification of related terms. The fully implemented system is described with particular attention on the role of derivational morphology and phrasal relations. Results and evaluation are presented in terms of precision and recall, with an analysis and discussion of errors. This paper illustrates how natural language processing tools, when combined effectively for tasks to which they are especially suited, indicates the pote...
On the usefulness of extracting syntactic dependencies for text indexing
- of Lecture Notes in Artificial Intelligence
, 2002
"... Abstract. In recent years, there has been a considerable amount of interest in using Natural Language Processing in Information Retrieval research, with specific implementations varying from the word-level morphological analysis to syntactic parsing to conceptual-level semantic analysis. In particul ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Abstract. In recent years, there has been a considerable amount of interest in using Natural Language Processing in Information Retrieval research, with specific implementations varying from the word-level morphological analysis to syntactic parsing to conceptual-level semantic analysis. In particular, different degrees of phrase-level syntactic information have been incorporated in information retrieval systems working on English or Germanic languages such as Dutch. In this paper we study the impact of using such information, in the form of syntactic dependency pairs, in the performance of a text retrieval system for a Romance language, Spanish. 1
Tokenization and proper noun recognition for information retrieval
- In 3rd International Workshop on Natural Language and Information Systems (NLIS 2002), September 2-3, 2002. Aix-en-Provence
, 2002
"... In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for pre-tagging tasks such as proper noun recognition. We also show ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
In this paper we consider a set of natural language processing techniques that can be used to analyze large amounts of texts, focusing on the advanced tokenizer which accounts for a number of complex linguistic phenomena, as well as for pre-tagging tasks such as proper noun recognition. We also show the results of several experiments performed in order to study the impact of the strategy chosen for the recognition of proper nouns. 1
Automatic language-specific stemming in information retrieval
- In Cross-language information retrieval and evaluation: Proceedings of the CLEF 2000 workshop
, 2001
"... Abstract. We employ Automorphology, an MDL-based algorithm that determines the suffixes present in a language-sample with no prior knowledge of the language in question, and describe our experiments on the usefulness of this approach for Information Retrieval, employing this stemmer in a SMARTbased ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Abstract. We employ Automorphology, an MDL-based algorithm that determines the suffixes present in a language-sample with no prior knowledge of the language in question, and describe our experiments on the usefulness of this approach for Information Retrieval, employing this stemmer in a SMARTbased IR engine. 1
Probabilistic Term Variant Generator for Biomedical Terms
, 2003
"... This paper presents an algorithm to generate possible variants for biomedical terms. The algorithm gives each variant its generation probability representing its plausibility, which is potentially useful for query and dictionary expansions. The probabilistic rules for generating variants are automat ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper presents an algorithm to generate possible variants for biomedical terms. The algorithm gives each variant its generation probability representing its plausibility, which is potentially useful for query and dictionary expansions. The probabilistic rules for generating variants are automatically learned from raw texts using an existing abbreviation extraction technique. Our method, therefore, requires no linguistic knowledge or labor-intensive natural language resource. We conducted an experiment using 83,142 MED-LINE abstracts for rule induction and 18,930 abstracts for testing. The results indicate that our method will significantly increase the number of retrieved documents for long biomedical terms.
A Nonparametric Method for Extraction of Candidate Phrasal Terms
- Proceedings of ACL’2005
, 2005
"... This paper introduces a new method for identifying candidate phrasal terms (also known as multiword units) which applies a nonparametric, rank-based heuristic measure. Evaluation of this measure, the mutual rank ratio metric, shows that it produces better results than standard statistical measures w ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
This paper introduces a new method for identifying candidate phrasal terms (also known as multiword units) which applies a nonparametric, rank-based heuristic measure. Evaluation of this measure, the mutual rank ratio metric, shows that it produces better results than standard statistical measures when applied to this task. 1
To stem or lemmatize a highly inflectional language in a probabilistic IR environment?
, 1993
"... Effects of three different morphological methods- lemmatization, stemming and inflectional stem generation- for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four point relevance scale which is partitioned differently in different test settings. Results ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Effects of three different morphological methods- lemmatization, stemming and inflectional stem generation- for Finnish are compared in a probabilistic IR environment (INQUERY). Evaluation is done using a four point relevance scale which is partitioned differently in different test settings. Results show that inflectional stem generation which has not been used much in IR, compares well with lemmatization in a best-match IR environment. Differences in performance between inflectional stem generation and lemmatization are small and they are not statistically significant in most of the tested settings. It is also shown that hitherto a rather neglected method of morphological processing for Finnish, stemming, performs reasonably well although the stemmer used – a Porter stemmer implementation – is far from optimal for a morphologically complex language like Finnish. In another series of tests, the effects of 1 compound splitting and derivational expansion of queries are tested.
The Contribution of Morphological Knowledge to French MeSH Mapping for Information Retrieval
, 2001
"... INTRODUCTION The Internet has become a major source of health information for the health professional and the citizen. However, general directories and search engines do not permit the end-user to obtain a clear and organized range of available useful health information. Therefore, a number of spec ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
INTRODUCTION The Internet has become a major source of health information for the health professional and the citizen. However, general directories and search engines do not permit the end-user to obtain a clear and organized range of available useful health information. Therefore, a number of specialized health directories and catalogs have been created, such as CliniWeb, HON and CISMeF, 1 some of which (including those three) are additionally indexed with the MeSH Thesaurus and provided with a search interface. This allows complex queries to be stated, which can take advantage of the hierarchical structure of the MeSH (e.g., `explode' function in Doc'CISMeF). Since not all users know the MeSH terms, the search interface must provide some way of mapping so-called "Natural Language" queries to MeSH terms. Mapping techniques try to reconciliate character-level differences (typos, uppercase, ac- 1 www.ohsu.edu/cliniweb/,<F9
Morphological and Syntactic Processing for Text Retrieval
- of Lecture Notes in Computer Science
, 2004
"... This article describes the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level. Several alternatives for selecting the index terms among the syntactic d ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
This article describes the application of lemmatization and shallow parsing as a linguistically-based alternative to stemming in Text Retrieval, with the aim of managing linguistic variation at both word level and phrase level. Several alternatives for selecting the index terms among the syntactic dependencies detected by the parser are evaluated. Though this article focusses on...

