The Organisation and Retrieval of Document Collections: A Machine Learning Approach (2003)
BibTeX
@MISC{Vinokourov03theorganisation,
author = {Alexei Vinokourov},
title = {The Organisation and Retrieval of Document Collections: A Machine Learning Approach},
year = {2003}
}
OpenURL
Abstract
THE ORGANISATION AND RETRIEVAL OF DOCUMENT COLLECTIONS: A MACHINE LEARNING APPROACH BY ALEXEI VINOKOUROV Doctor of Philosophy School of Information and Communication Technologies University of Paisley Paisley, Scotland, 2003 The enormous growth of (online) text information available in digital form has raised the problem of automatic structuring and processing of large document collections. Consequently, the need for automatic organization of large text collections has become an important issue in modern text information access systems. This problem is identified as the Information Organisation problem. In this thesis we present a method termed Multinomial ASymmetric Hierarchical Analysis (MASHA) that allows one to automate the structuring of a large document collection into a hierarchy of topics. We also explore the use of the obtained structure to improve performance in document retrieval and classification applications. In addition to other similar works, we also present a method for the deduction of hierarchies from text corpora or, in other words, for finding a vi most appropriate (for a given document collection) topic hierarchy that would reflect the structure of the textual data in terms of interrelationships between hypothetical topics that presumably underlie the collection or are most appropriate to categorise documents in the collection. Unfortunately the cost of learning probabilistic models is, at best, proportional to the size of the collection, size of vocabularly and size of derived hierarchy. It appears, however, that for some tasks, particularly, for crosslingual information retrieval, the computational cost can be reduced by employing other methods. One such method, the kernel Canonical Correlation Analysis (KCCA), learns a semantic represen...







