Results 1 - 10
of
479
Similarity Measures for Text Document Clustering
, 2008
"... Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to ..."
Abstract
-
Cited by 62 (0 self)
- Add to MetaCart
(Show Context)
Clustering is a useful technique that organizes a large quantity of unordered text documents into a small number of meaningful and coherent clusters, thereby providing a basis for intuitive and informative navigation and browsing mechanisms. Partitional clustering algorithms have been recognized to be more suitable as opposed to the hierarchical clustering schemes for processing large datasets. A wide variety of distance functions and similarity measures have been used for clustering, such as squared Euclidean distance, cosine similarity, and relative entropy. In this paper, we compare and analyze the effectiveness of these measures in partitional clustering for text document datasets. Our experiments utilize the standard K-means algorithm and we report results on seven text document datasets and five distance/similarity measures that have been most commonly used in text clustering.
Fast Approximate Spectral Clustering
, 2009
"... Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spe ..."
Abstract
-
Cited by 58 (1 self)
- Add to MetaCart
(Show Context)
Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nyström method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes. 1
The Planar k-means Problem is NP-hard
, 2009
"... In the k-means problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this well-known problem is NP-hard even for instanc ..."
Abstract
-
Cited by 45 (0 self)
- Add to MetaCart
In the k-means problem, we are given a finite set S of points in ℜ m, and integer k ≥ 1, and we want to find k points (centers) so as to minimize the sum of the square of the Euclidean distance of each point in S to its nearest center. We show that this well-known problem is NP-hard even for instances in the plane, answering an open question posed by Dasgupta [7].
Beyond the Euclidean distance: Creating effective visual codebooks using the histogram intersection kernel
"... Common visual codebook generation methods used in a Bag of Visual words model, e.g. k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Hist ..."
Abstract
-
Cited by 42 (3 self)
- Add to MetaCart
(Show Context)
Common visual codebook generation methods used in a Bag of Visual words model, e.g. k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks with histogram features. In this paper, we demonstrate that HIK can also be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as k-means. The HIK codebook has consistently higher recognition accuracy over k-means codebooks by 2-4%. In addition, we propose a one-class SVM formulation to create more effective visual code words which can achieve even higher accuracy. The proposed method has established new state-of-the-art performance numbers for 3 popular benchmark datasets on object and scene recognition. In addition, we show that the standard k-median clustering method can be used for visual codebook generation and can act as a compromise between HIK and k-means approaches. 1.
Approximation algorithms for clustering uncertain data
- in PODS Conference
, 2008
"... There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such ..."
Abstract
-
Cited by 39 (1 self)
- Add to MetaCart
(Show Context)
There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such data in a DBMS, it is necessary to perform other data analysis tasks such as data mining. We study the core mining problem of clustering on uncertain data, and define appropriate natural generalizations of standard clustering optimization criteria. Two variations arise, depending on whether a point is automatically associated with its optimal center, or whether it must be assigned to a fixed cluster no matter where it is actually located. For uncertain versions of k-means and k-median, we show reductions to their corresponding weighted versions on data with no uncertainties. These are simple in the unassigned case, but require some care for the assigned version. Our most interesting results are for uncertain k-center, which generalizes both traditional k-center and k-median objectives. We show a variety of bicriteria approximation algorithms. One picks O(kɛ −1 log 2 n) centers and achieves a (1 + ɛ) approximation to the best uncertain k-centers. Another picks 2k centers and achieves a constant factor approximation. Collectively, these results are the first known guaranteed approximation algorithms for the problems of clustering uncertain data.
Sided and symmetrized Bregman centroids
- IEEE Transactions on Information Theory
, 2009
"... Abstract—We generalize the notions of centroids (and barycenters) to the broad class of information-theoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various well-known statistic ..."
Abstract
-
Cited by 37 (13 self)
- Add to MetaCart
(Show Context)
Abstract—We generalize the notions of centroids (and barycenters) to the broad class of information-theoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various well-known statistical entropic measures. Since besides the squared Euclidean distance, Bregman divergences are asymmetric, we consider the left-sided and rightsided centroids and the symmetrized centroids as minimizers of average Bregman distortions. We prove that all three centroids are unique and give closed-form solutions for the sided centroids that are generalized means. Furthermore, we design a provably fast and efficient arbitrary close approximation algorithm for the symmetrized centroid based on its exact geometric characterization. The geometric approximation algorithm requires only to walk on a geodesic linking the two left/right sided centroids. We report on our implementation for computing entropic centers of image histogram clusters and entropic centers of multivariate normal distributions that are useful operations for processing multimedia information and retrieval. These experiments illustrate that our generic methods compare favorably with former limited ad-hoc methods. Index Terms—Centroid, Kullback-Leibler divergence, Bregman divergence, Bregman power divergence, Burbea-Rao divergence,
deSEO: Combating Search-Result Poisoning
- In Proceedings of the 20th USENIX Security Symposium
, 2011
"... We perform an in-depth study of SEO attacks that spread malware by poisoning search results for popular queries. Such attacks, although recent, appear to be both widespread and effective. They compromise legitimate Web sites and generate a large number of fake pages targeting trendy keywords. We fir ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
(Show Context)
We perform an in-depth study of SEO attacks that spread malware by poisoning search results for popular queries. Such attacks, although recent, appear to be both widespread and effective. They compromise legitimate Web sites and generate a large number of fake pages targeting trendy keywords. We first dissect one example attack that affects over 5,000 Web domains and attracts over 81,000 user visits. Further, we develop de-SEO, a system that automatically detects these attacks. Using large datasets with hundreds of billions of URLs, deSEO successfully identifies multiple malicious SEO campaigns. In particular, applying the URL signatures derived from deSEO, we find 36 % of sampled searches to Google and Bing contain at least one malicious link in the top results at the time of our experiment. 1
SUN: Top-down saliency using natural statistics
"... When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such “top-down ” knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of objec ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such “top-down ” knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of object appearance has been infrequently investigated. Here we present an appearance based saliency model derived in a Bayesian framework. We compare our approach with both bottom-up saliency algorithms as well as the state-of-the-art Contextual Guidance model of Torralba et al. (2006) at predicting human fixations. Although both top-down approaches use very different types of information, they achieve similar performance; each substantially better than the purely bottom-up models. Our experiments reveal that a simple model of object appearance can predict human fixations quite well, even making the same mistakes as people.
On Centroidal Voronoi Tessellation -- Energy Smoothness and Fast Computation
, 2008
"... Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in ..."
Abstract
-
Cited by 33 (15 self)
- Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications in
Streaming k-means approximation
"... We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive d ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
(Show Context)
We provide a clustering algorithm that approximately optimizes the k-means objective, in the one-pass streaming setting. We make no assumptions about the data, and our algorithm is very light-weight in terms of memory, and computation. This setting is applicable to unsupervised learning on massive data sets, or resource-constrained devices. The two main ingredients of our theoretical work are: a derivation of an extremely simple pseudo-approximation batch algorithm for k-means (based on the recent k-means++), in which the algorithm is allowed to output more than k centers, and a streaming clustering algorithm in which batch clustering algorithms are performed on small inputs (fitting in memory) and combined in a hierarchical manner. Empirical evaluations on real and simulated data reveal the practical utility of our method. 1