| M. Junker and R. Hoch. Evaluating OCR and non-OCR text representations for learning document classifiers. In Proc. ICDAR 97, 1997. 5 |
....the OCR process. No solution to the problem is given, although it is reasonable to assume that the incorporation of a specialized term weighting scheme for OCR documents, such as the ones described above, would help to improve performance. Evidence of this assumption is presented in [Cavn94] and [Junk97], wherein advanced techniques such as n gram processing and morphological analysis are used to aid in reducing the effect of imperfections introduced by the OCR process on retrieval effectiveness. A survey of common techniques used to enhance effectiveness in text categorization can be found in ....
M. Junker and R. Hoch. Evaluating OCR and non-OCR text representations for learning document classifiers. In Proc. ICDAR 97, 1997. 5
....information about the font, size, and position of text, that may be important to help discriminating between classes. Moreover, OCR text is noisy and another direction for improvement is to include more sophisticated feature selection methods, like morphological analysis or the use of n grams [4, 14]. Another aspect is the granularity of document structure being exploited. Working at the level of pages is straightforward since page boundaries are readily available. However, actual category boundaries may not coincide with page boundaries and some pages contains portions of text related to ....
M. Junker and R. Hoch. Evaluating OCR and non-OCR text representations for learning document classifiers. In Prof. ICDAR 97, 1997.
....the document s icon onto the CM program icon. He or she may then provide a textual description and explanation for the document. These texts can be utilized to automatically assign the new knowledge item to respective nodes of the structuring models applying techniques from text classification [14]. However, the user may also manually classify the knowledge item by clicking the respective nodes in the models. A knowledge item may be assigned to more than one node in a single model. The models always highlight the nodes which correspond to the knowledge item under work. If the actual ....
....documents itself. In this way text analysis can be employed to even categorize photos, videos, tables, presentations, etc. Therefore, a rule based text classification tool that automatically learns and applies classification rules for the nodes in each model has been integrated into the prototype [14]. The rich possibilities to link knowledge items (references to source, links to nodes in the models, links to other knowledge items) demand for the application of hypermedia techniques. Therefore, knowledge items, models, and nodes are consequently defined as hypermedia nodes. For the prototype ....
Junker, M and Hoch, R. 1997. Evaluating OCR and Non-OCR Text Representations for Document Classification. In Proceedings ICDAR-97, Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, August.
Online articles have much greater impact More about CiteSeer.IST Add search form to your site Submit documents Feedback
CiteSeer.IST - Copyright Penn State and NEC