3 citations found. Retrieving documents...
S Ruger and S Gauch. Feature reduction for document clustering and classification. Technical report, Computing Department, Imperial College London, UK, 2000.

 Home/Search   Document Details and Download   Summary   Related Articles   Check  

This paper is cited in the following contexts:
Interactive Clustering for Exploration of Genomic Data - Wan, Bridges (2002)   (Correct)

....algorithms and the development of new algorithms [8] Key problems to deal with in cluster analysis are determining which features to use to generate meaningful clusters and interpreting cluster results. Interactive clustering has received substantial attention within the text mining community [1, 16]. We describe an interactive, iterative clustering approach for exploration of genomic data that allows scientists to visualize cluster results and direct the clustering process. This method allows incremental exploration of clusters of sequential patterns. Gene clustering attempts to partition ....

S. M. Ruger and S. E. Gauch, Feature Reduction for Document Clustering and Classification. DTR


A Visualization Interface for Document Searching and Browsing - Carey, Kriwaczek, Rüger (2000)   (Correct)

....deviation and vector size gets ever smaller, as it scales with 1 # n. This is a generic statistical property of high dimensional spaces with any standard distance measure, and can be traced down to the law of large numbers. For a more detailed discussion about the curse of dimensionality see [19]. Although word histogram document representations are by no means random vectors, each additional dimension tends not only to spread the size of a cluster but also to dilute the distance of two previously well separated clusters. Hence, it seems prohibitive involving all semantic features (eg, ....

....keyword j for document i. In particular, u ij = 0 if and only if document i does not contain keyword j. The number of features k can be controlled by the experimenter, and our experiments using the TREC data of human relevance judgements have shown that k # 10 yields superior clustering results [19]. Note also that even if only the top ten keywords are used for the clustering and document representation, we might still display more keywords on the screen to assist the user in his or her search. 2.3 Document Clustering Post retrieval document clustering has been well studied in the recent ....

[Article contains additional citation context not shown here]

S M R uger and S E Gauch. Feature reduction for document clustering and classification. Technical report, Computing Department, Imperial College, London, UK, 2000.


Info Navigator: A Visualization Tool for Document.. - Carey, Heesch, Rüger (2003)   Self-citation (Ruger)   (Correct)

....picked vectors in a high dimensional hypercube tend to have a constant distance from each other, no matter what the measure is. As a consequence, clustering algorithms that are based on document distances become unreliable. For a more detailed discussion about the curse of dimensionality see [19]. Even after applying feature reduction, the number of features remains large. In our clustering experiments with 548,948 documents , a candidate keyword had to appear in at least three documents and in no more that 33 of all documents. This resulted in a vocabulary of around 222,872 ....

.... for document . In particular, 10 if and only if document does not contain keyword . The number of features can be controlled by the experimenter, and our experiments using the TREC data of human relevance judgements have shown that 32 40 yields superior clustering results [19]. Note that even if only the top ten keywords are used for the clustering and document representation, we might still display more keywords on the screen to assist the user in his or her search. 2.3. Document Clustering Post retrieval document clustering has been well studied, eg [9, 1, 15, 10, ....

[Article contains additional citation context not shown here]

S Ruger and S Gauch. Feature reduction for document clustering and classification. Technical report, Computing Department, Imperial College London, UK, 2000.

Online articles have much greater impact   More about CiteSeer.IST   Add search form to your site   Submit documents   Feedback  

CiteSeer.IST - Copyright Penn State and NEC