Results 1 -
7 of
7
A neighborhood-based approach for clustering of linked document collections
- In CIKM ’06: Proceedings of the 15th ACM International Conference on Information and Knowledge Management
, 2006
"... This paper addresses the problem of automatically structuring linked document collections by using clustering. In contrast to traditional clustering, we study the clustering problem in the light of available link structure information for the data set (e.g., hyperlinks among web documents or coautho ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
This paper addresses the problem of automatically structuring linked document collections by using clustering. In contrast to traditional clustering, we study the clustering problem in the light of available link structure information for the data set (e.g., hyperlinks among web documents or coauthorship among bibliographic data entries). Our approach is based on iterative relaxation of cluster assignments, and can be built on top of any clustering algorithm. This technique results in higher cluster purity, better overall accuracy, and make self-organization more robust.
Model-Based Classification of Web Documents Represented by Graphs
- In Proc. of WebKDD 2006: KDD Workshop on Web Mining and Web Usage Analysis, in conjunction with the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006
, 2006
"... Most web content classification methods are based on the vectorspace model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document rep ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Most web content classification methods are based on the vectorspace model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that can be easily extracted from the web document HTML tags. A recently developed graph-based web document representation model can preserve web document structural information. It was shown to outperform the traditional vector representation, using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this paper, three new, hybrid approaches to web document classification are presented, built upon both graph and vector space representations, thus preserving the benefits and discarding the limitations of each. The hybrid methods presented here are compared to vector-based models using two model-based classifiers (C4.5 decision-tree algorithm and probabilistic Naïve Bayes) on two benchmark web document collections. The results demonstrate that the hybrid methods presented in this paper outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant reduction in the classification time.
What are Ontologies Good For? Evaluating Terminological Ontologies in the Framework of Text Graph Classification — Extended Abstract —
"... Abstract. This paper develops a graph-theoretical model of text representation based on lexical chaining. Other than present approaches to chaining, this model reflects the logical document structure of texts as well as semantic relations of their lexical constituents in order to compute text simila ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. This paper develops a graph-theoretical model of text representation based on lexical chaining. Other than present approaches to chaining, this model reflects the logical document structure of texts as well as semantic relations of their lexical constituents in order to compute text similarity values. By varying the terminological ontology used to induce such relations, a door is opened to systematically evaluate their contribution to text classification. This is exemplified by example of GermaNet and the Wikipedia. 1
Sobek: a Text Mining Tool for Educational Applications 1 E. Reategui, 1
"... Abstract — This paper presents a mining tool to extract relevant terms and relationships from texts, and proposes its use in educational applications. A particular text mining technique is employed to analyze texts and build graphs from them, in which nodes represent concepts and edges represent the ..."
Abstract
- Add to MetaCart
Abstract — This paper presents a mining tool to extract relevant terms and relationships from texts, and proposes its use in educational applications. A particular text mining technique is employed to analyze texts and build graphs from them, in which nodes represent concepts and edges represent the relationships between them. Some adjustments are proposed here in the original mining and representation methods, in order to provide results which are more suitable for our educational applications. Two experiments exemplifying the extraction of graphs from students ’ essays are presented in the paper. Results showed that the mining tool was able to identify a considerable number of relevant terms from the texts analyzed, providing concise representations of documents which can support students ’ and teachers ’ tasks.
A Composite Graph Model for Web Document and the MCS Technique
"... It has been accepted that a graph can represent any document with minimum loss of information. In this article we are going to put forward some new standards of graph representation and graph distance measure for web documents. With the proposed enhanced method of graph representation and distance m ..."
Abstract
- Add to MetaCart
It has been accepted that a graph can represent any document with minimum loss of information. In this article we are going to put forward some new standards of graph representation and graph distance measure for web documents. With the proposed enhanced method of graph representation and distance measure we would be able to hold more information than usual and hence classify them more efficiently.

