Results 1 - 10
of
17
A survey of retrieval strategies for ocr text collections
- In Proceedings of the Symposium on Document Image Understanding Technologies
, 2003
"... The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The importance of effectively retrieving OCR text has grown significantly in recent years. We provide a brief overview of work done to improve the effectiveness of retrieval of OCR text.
Report on the TREC-11 Experiment: Arabic, Named Page and . . .
- IN PROCEEDINGS OF THE ELEVENTH TEXT RETRIEVAL CONFERENCE TREC-2002 (PP. 765–774). NIST SPECIAL PUBLICATION
, 2003
"... ... This document representation uses multi-vectors in order to highlight the importance of both link anchor information and document content. ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
... This document representation uses multi-vectors in order to highlight the importance of both link anchor information and document content.
Stemming Arabic Conjunctions and Prepositions
"... Abstract. Arabic is the fourth most widely spoken language in the world, and is characterised by a high rate of inflection. To cater for this, most Arabic information retrieval systems incorporate a stemming stage. Most existing Arabic stemmers are derived from English equivalents; however, unlike E ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Arabic is the fourth most widely spoken language in the world, and is characterised by a high rate of inflection. To cater for this, most Arabic information retrieval systems incorporate a stemming stage. Most existing Arabic stemmers are derived from English equivalents; however, unlike English, most affixes in Arabic are difficult to discriminate from the core word. Removing incorrectly identified affixes sometimes results in a valid but incorrect stem, and in most cases reduces retrieval precision. Conjunctions and prepositions form an interesting class of these affixes. In this work, we present novel approaches for dealing with these affixes. Unlike previous approaches, our approaches focus on retaining valid Arabic core words, while maintaining high retrieval performance. 1
JHU/APL at TREC 2002: Experiments in Filtering and Arabic Retrieval
- NIST Special Publication: SP 500-255, The Twelfth Text Retrieval Conference (TREC 2002
"... Laboratory (JHU/APL) participated in two tracks at this year’s conference. We participated in the filtering track, again addressing the batch and routing subtasks, as well as the adaptive task for the first time. We also continued experiments in Arabic retrieval, emphasizing language-neutral approac ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Laboratory (JHU/APL) participated in two tracks at this year’s conference. We participated in the filtering track, again addressing the batch and routing subtasks, as well as the adaptive task for the first time. We also continued experiments in Arabic retrieval, emphasizing language-neutral approaches. For ranked retrieval, we relied on a statistical language model to compute query/document similarity values. Hiemstra and de Vries describe such a linguistically motivated probabilistic model and explain how it relates to both the Boolean and vector space models [4]. The model has also been cast as a rudimentary Hidden Markov Model [13]. Although the model does not explicitly incorporate inverse document frequency, it does favor documents
Building a Heterogeneous Information Retrieval Test Collection of Arabic Document Images
"... This paper describes the development of an Arabic document image collection containing 34,651 documents from 1,378 different books and 25 topics with their relevance judgments. The books from which the collection is obtained are a part of a larger collection 75,000 books being scanned for archival a ..."
Abstract
- Add to MetaCart
(Show Context)
This paper describes the development of an Arabic document image collection containing 34,651 documents from 1,378 different books and 25 topics with their relevance judgments. The books from which the collection is obtained are a part of a larger collection 75,000 books being scanned for archival and retrieval at the Bibliotheca Alexandrina (BA). The documents in the collection vary widely in topics, fonts, and degradation levels. Initial baseline experiments were performed to examine the effectiveness of different index terms, with and without blind relevance feedback, on Arabic OCR degraded text. 1.
Arabic Retrieval Revisited: Morphological Hole Filling
"... Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morph ..."
Abstract
- Add to MetaCart
(Show Context)
Due to Arabic’s morphological complexity, Arabic retrieval benefits greatly from morphological analysis – particularly stemming. However, the best known stemming does not handle linguistic phenomena such as broken plurals and malformed stems. In this paper we propose a model of character-level morphological transformation that is trained using Wikipedia hypertext to page title links. The use of our model yields statistically significant improvements in Arabic retrieval over the use of the best statistical stemming technique. The technique can potentially be applied to other languages. 1.
Balanced Query Methods for Improving OCR-Based Retrieval
"... Since many documents are available only print, improving OCR-based retrieval of scanned documents is an important problem. This paper presents a novel language independent technique for mapping queries from an error-free space to an OCR-degraded document space using a noisy channel model to produce ..."
Abstract
- Add to MetaCart
(Show Context)
Since many documents are available only print, improving OCR-based retrieval of scanned documents is an important problem. This paper presents a novel language independent technique for mapping queries from an error-free space to an OCR-degraded document space using a noisy channel model to produce possible degraded versions of query terms. The new technique yielded statistically significant improvements in retrieval effectiveness of as much as 39 % over clean queries when tested on an Arabic document image collection. 1
TREC 2002 Cross-lingual Retrieval at BBN Alexander Fraser 1, Jinxi Xu and Ralph Weischedel BBN Technologies 50 Moulton Street
- in Proceedings of the 11th Text REtrieval Conference (TREC
, 2002
"... this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the Defense Advanced Research Projects Agency, the SPAWAR Systems Center, or the United States Government ..."
Abstract
- Add to MetaCart
this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied of the Defense Advanced Research Projects Agency, the SPAWAR Systems Center, or the United States Government
JHU/APL at TREC 2002: Experiments in Filtering and Arabic Retrieval
"... plied SVMs to this year's Filtering tasks; however, some of our routing runs were based on statistical language models instead. Filtering Track We participated in the routing, batch and adaptive tasks of the filtering track. Filtering Approach Background We continued to investigate the appl ..."
Abstract
- Add to MetaCart
plied SVMs to this year's Filtering tasks; however, some of our routing runs were based on statistical language models instead. Filtering Track We participated in the routing, batch and adaptive tasks of the filtering track. Filtering Approach Background We continued to investigate the application of Support Vector Machines (SVMs) to filtering tasks. SVMs are used to create classifiers from a set of labeled training data, finding a hyperplane (possibly in a transformed space) to separate positive examples from negative examples. This hyperplane is chosen to maximize the margin (or distance) to the training points. The promise of large margin classification is that it does not overfit the training data and generalizes well to test data of similar distribution. See Hearst [3] for a general discussion of SVMs. We used the SVM-light package (version 3.50, by Thorsten Joachims [15]) to create classifiers based on the training data for classification of the test data, and wrote a JNI int
Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology
"... This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts c ..."
Abstract
- Add to MetaCart
(Show Context)
This paper explores the use of a character segment based character correction model, language modeling, and shallow morphology for Arabic OCR error correction. Experimentation shows that character segment based correction is superior to single character correction and that language modeling boosts correction, by improving the ranking of candidate corrections, while shallow morphology had a small adverse effect. Further, given sufficiently large corpus to extract a dictionary and to train a language model, word based correction works well for a morphologically rich language such as Arabic. 1