Results 1  10
of
11
On Clusterings: Good, Bad and Spectral
, 2003
"... We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic is shown to have polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of ..."
Abstract

Cited by 336 (13 self)
 Add to MetaCart
We motivate and develop a natural bicriteria measure for assessing the quality of a clustering which avoids the drawbacks of existing measures. A simple recursive heuristic is shown to have polylogarithmic worstcase guarantees under the new measure. The main result of the paper is the analysis of a popular spectral algorithm. One variant of spectral clustering turns out to have effective worstcase guarantees; another finds a "good" clustering, if one exists.
Spectral clustering in telephone call graphs
 In WebKDD/SNAKDD Workshop 2007 in conjunction with KDD
, 2007
"... We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common ..."
Abstract

Cited by 15 (2 self)
 Add to MetaCart
(Show Context)
We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common for several data sources but, to our knowledge, not described in the literature. DivideandMerge, a recently described postfiltering procedure may be used to eliminate bad quality branches in a binary tree hierarchy. We propose an alternate solution that enables kway cuts in each step by immediately filtering unbalanced or low quality clusters before splitting them further. Our experiments are performed on graphs with various weight and normalization built based on call detail records. We investigate a period of eight months of more than two millions of Hungarian landline telephone users. We measure clustering quality both by cluster ratio as well as by the geographic homogeneity of the clusters obtained from telephone location data. Although divideandmerge optimizes its clusters for cluster ratio, our method produces clusters of similar ratio much faster, furthermore we give geographically much more homogeneous clusters with the size distribution of our clusters resembling to that of the settlement structure.
Kboost: A Scalable Algorithm for High Quality Clustering of Microarray Gene Expression Data TR IIT2007015, Istituto di Informatica e Telematica del CNR
, 2007
"... We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical kmeans iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the realtime nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.
M.: Cluster generation and cluster labelling for web snippets: A fast and accurate hierarchical solution
 In Proceedings of the 13th Symposium on String Processing and Information Retrieval (SPIRE 2006
, 2006
"... Abstract. This paper describes Armil, a metasearch engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Abstract. This paper describes Armil, a metasearch engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Striking the right balance between running time and cluster wellformedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and use no external sources of knowledge. Clustering is performed by means of a fast version of the furthestpointfirst algorithm for metric kcenter clustering. Cluster labelling is achieved by combining intracluster and intercluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. 1
Cluster generation and labeling for web snippets: A fast, accurate hierarchical solution
 Journal of Internet Mathematics
, 2006
"... Abstract. This paper describes Armil, a metasearch engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information n ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Abstract. This paper describes Armil, a metasearch engine that groups the web snippets returned by auxiliary search engines into disjoint labeled clusters. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to his/her information need. Striking the right balance between running time and cluster wellformedness was a key point in the design of our system. Both the clustering and the labeling tasks are performed on the fly by processing only the snippets provided by the auxiliary search engines, and they use no external sources of knowledge. Clustering is performed by means of a fast version of the furthestpointfirst algorithm for metric kcenter clustering. Cluster labeling is achieved by combining intracluster and intercluster term extraction based on a variant of the information gain measure. We have tested the clustering effectiveness of Armil against Vivisimo, the de facto industrial standard in web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Directory Project hierarchy. According to two widely accepted “external ” metrics of clustering quality, Armil achieves better performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labeling algorithms. On a standard desktop PC (AMD Athlon 1Ghz Clock with 750 Mbytes RAM), Armil performs clustering and labeling altogether of up to 200 snippets in less than one second. 1.
LargeScale Principal Component Analysis on LiveJournal Friends Network
"... Principal Component Analysis (PCA) is a general means of unsupervised exploration that can be used to find basic motives and organizational themes, the guidance in friends network formation. The applications of PCA include Kleinberg’s ranking algorithm as well as spectral graph partitioning. We exte ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
Principal Component Analysis (PCA) is a general means of unsupervised exploration that can be used to find basic motives and organizational themes, the guidance in friends network formation. The applications of PCA include Kleinberg’s ranking algorithm as well as spectral graph partitioning. We extend the applicability of PCA to very large scale social networks by handling the abundance of small size communities that hide the higher level structure. Strongest communities, that are still small themselves, take over the first principal axes and the analysis leaves a giant mass in the allzeroes coordinate. In a combination of heuristics that involve the removal of community cores as well as the contraction of tentacles we are able to find meaningful high level components that characterize countries, regions, age or interest in polarized topics. Our experiments are run on a 3.5M user snapshot of the LiveJournal
ABSTRACT A Scalable Algorithm for HighQuality Clustering of Web Snippets
"... We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the ..."
Abstract
 Add to MetaCart
(Show Context)
We consider the problem of partitioning, in a highly accurate and highly efficient way, a set of n documents lying in a metric space into k nonoverlapping clusters. We augment the wellknown furthestpointfirst algorithm for kcenter clustering in metric spaces with a filtering scheme based on the triangular inequality. We apply this algorithm to Web snippet clustering, comparing it against strong baselines consisting of recent, fast variants of the classical kmeans iterative algorithm. Our main conclusion is that our method attains solutions of better or comparable accuracy, and does this within a fraction of the time required by the baselines. Our algorithm is thus valuable when, as in Web snippet clustering, either the realtime nature of the task or the large amount of data make the poorly scalable, traditional clustering methods unsuitable.
An Innovative Approach in Text Mining (1) R.Santhanalakshmi,
"... The text mining is the classification and predictive modelings that are based on bootstrapping techniques reuse a source data set for the specific application, which is specialized for avoid the information overloading and redundancy. The results offer a classification and prediction results are mi ..."
Abstract
 Add to MetaCart
The text mining is the classification and predictive modelings that are based on bootstrapping techniques reuse a source data set for the specific application, which is specialized for avoid the information overloading and redundancy. The results offer a classification and prediction results are minimum compare with the original data source. Text is the common approach used to examine text and data in order to draw conclusions about the structure and relationships between sets of information contained in the original set or approximate the some expected values. In this paper we are going to retrieve the bovine diseases information form the internet using kmeans clustering and principal component analysis.