Using the Generic Document Profile to Cluster Similar Texts.
BibTeX
@MISC{Ellman_usingthe,
author = {Jeremy Ellman},
title = {Using the Generic Document Profile to Cluster Similar Texts.},
year = {}
}
OpenURL
Abstract
The World Wide Web contains a huge quantity of text that is notoriously inefficient to use. This work aims to apply a text processing technique based on thesaurally derived lexical chains to improve Internet Information Retrieval where a lexical chain a set of words in a text that are related by both proximity, and by relations derived from an external lexical knowledge source such as WordNet, Roget's Thesaurus, LDOCE, and so on. Finding Information on the Internet is notoriously hard, even when users have a clear focus to their queries. This situation is exacerbated when users only have vague notions about the topics they wish to explore. This could be remedied using Exemplar Texts, where an Exemplar Text is the ideal model result for Web searches. Our problem is now transformed into one of identifying similar texts. The Generic Document Profile is designed to allow the comparison of document similarity whilst being independent of terminology and document length. It is simply a set of semantic categories derived from Roget's thesaurus with associated weights. These weights are based on lexical chain length and strength. A Generic Document Profile can be compared to another using a Case Based Reasoning approach. Case Based Reasoning (CBR) is a problem solving method that seeks to solve existing problems by reference to previous successful solutions. Here our Exemplar Texts count as previous solutions (and in these







