Results 1 - 10
of
97
Fast Approximate Spectral Clustering
, 2009
"... Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spe ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n 3), with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nyström method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes. 1
Approximation algorithms for clustering uncertain data
- in PODS Conference
, 2008
"... There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
There is an increasing quantity of data with uncertainty arising from applications such as sensor network measurements, record linkage, and as output of mining algorithms. This uncertainty is typically formalized as probability density functions over tuple values. Beyond storing and processing such data in a DBMS, it is necessary to perform other data analysis tasks such as data mining. We study the core mining problem of clustering on uncertain data, and define appropriate natural generalizations of standard clustering optimization criteria. Two variations arise, depending on whether a point is automatically associated with its optimal center, or whether it must be assigned to a fixed cluster no matter where it is actually located. For uncertain versions of k-means and k-median, we show reductions to their corresponding weighted versions on data with no uncertainties. These are simple in the unassigned case, but require some care for the assigned version. Our most interesting results are for uncertain k-center, which generalizes both traditional k-center and k-median objectives. We show a variety of bicriteria approximation algorithms. One picks O(kɛ −1 log 2 n) centers and achieves a (1 + ɛ) approximation to the best uncertain k-centers. Another picks 2k centers and achieves a constant factor approximation. Collectively, these results are the first known guaranteed approximation algorithms for the problems of clustering uncertain data.
SUN: Top-down saliency using natural statistics
"... When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such “top-down ” knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of objec ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
When people try to find particular objects in natural scenes they make extensive use of knowledge about how and where objects tend to appear in a scene. Although many forms of such “top-down ” knowledge have been incorporated into saliency map models of visual search, surprisingly, the role of object appearance has been infrequently investigated. Here we present an appearance based saliency model derived in a Bayesian framework. We compare our approach with both bottom-up saliency algorithms as well as the state-of-the-art Contextual Guidance model of Torralba et al. (2006) at predicting human fixations. Although both top-down approaches use very different types of information, they achieve similar performance; each substantially better than the purely bottom-up models. Our experiments reveal that a simple model of object appearance can predict human fixations quite well, even making the same mistakes as people.
Sided and symmetrized Bregman centroids
- IEEE Transactions on Information Theory
, 2009
"... Abstract—In this paper, we generalize the notions of centroids (and barycenters) to the broad class of information-theoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various well- ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Abstract—In this paper, we generalize the notions of centroids (and barycenters) to the broad class of information-theoretic distortion measures called Bregman divergences. Bregman divergences form a rich and versatile family of distances that unifies quadratic Euclidean distances with various well-known statistical entropic measures. Since besides the squared Euclidean distance, Bregman divergences are asymmetric, we consider the left-sided and right-sided centroids and the symmetrized centroids as minimizers of average Bregman distortions. We prove that all three centroids are unique and give closed-form solutions for the sided centroids that are generalized means. Furthermore, we design a provably fast and efficient arbitrary close approximation algorithm for the symmetrized centroid based on its exact geometric characterization. The geometric approximation algorithm requires only to walk on a geodesic linking the two left/right-sided centroids. We report on our implementation for computing entropic centers of image histogram clusters and entropic centers of multivariate normal distributions that are useful operations for processing multimedia information and retrieval. These experiments illustrate that our generic methods compare favorably with former limited ad hoc methods. Index Terms—Bregman divergence, Bregman information, Bregman power divergence, Burbea–Rao divergence, centroid,
Coresets and approximate clustering for Bregman divergences
- In Proc. of the 20th ACM-SIAM Symp. on Discrete Algorithms (SODA
, 2009
"... We study the generalized k-median problem with respect to a Bregman divergence Dφ. Given a finite set P ⊆ Rd of size n, our goal is to find a set C of size k such that the sum of errors cost(P, C) = p∈P minc∈C Dφ(p, c) } is minimized. The Bregman k-median problem plays an important role in many appl ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
We study the generalized k-median problem with respect to a Bregman divergence Dφ. Given a finite set P ⊆ Rd of size n, our goal is to find a set C of size k such that the sum of errors cost(P, C) = p∈P minc∈C Dφ(p, c) } is minimized. The Bregman k-median problem plays an important role in many applications, e.g. information theory, statistics, text classification, and speech processing. We give the first coreset construction for this problem for a large subclass of Bregman divergences, including important dissimilarity measures such as the Kullback-Leibler divergence and the Itakura-Saito divergence. Using these coresets, we give a (1 + ɛ)-approximation algorithm for( the Bregman k-median problem with running time O dkn + d2 k ( 2 ɛ)Θ(1) log k+2) n. This result improves over the previousely fastest known (1+ɛ)-approximation algorithm from [1]. Unlike the analysis of most coreset constructions our analysis does not rely on the construction of ɛ-nets. Instead, we prove our results by purely combinatorial means. 1
On Centroidal Voronoi Tessellation — Energy Smoothness and Fast Computation
"... Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a fundamental geometric structure that finds many applications
Beyond the Euclidean distance: Creating effective visual codebooks using the histogram intersection kernel
"... Common visual codebook generation methods used in a Bag of Visual words model, e.g. k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Hist ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Common visual codebook generation methods used in a Bag of Visual words model, e.g. k-means or Gaussian Mixture Model, use the Euclidean distance to cluster features into visual code words. However, most popular visual descriptors are histograms of image measurements. It has been shown that the Histogram Intersection Kernel (HIK) is more effective than the Euclidean distance in supervised learning tasks with histogram features. In this paper, we demonstrate that HIK can also be used in an unsupervised manner to significantly improve the generation of visual codebooks. We propose a histogram kernel k-means algorithm which is easy to implement and runs almost as fast as k-means. The HIK codebook has consistently higher recognition accuracy over k-means codebooks by 2-4%. In addition, we propose a one-class SVM formulation to create more effective visual code words which can achieve even higher accuracy. The proposed method has established new state-of-the-art performance numbers for 3 popular benchmark datasets on object and scene recognition. In addition, we show that the standard k-median clustering method can be used for visual codebook generation and can act as a compromise between HIK and k-means approaches. 1.
P.: Frame level audio similarity - a codebook approach
- In: Proc. of the 11th International Conference on Digital Audio Effects (DAFx-08) (2008
"... Modeling audio signals by the long-term statistical distribution of their local spectral features- often denoted as bag of frames approach (BOF)- is a popular and powerful method to describe audio content. While modeling the distribution of local spectral features by semi-parametric distributions (e ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Modeling audio signals by the long-term statistical distribution of their local spectral features- often denoted as bag of frames approach (BOF)- is a popular and powerful method to describe audio content. While modeling the distribution of local spectral features by semi-parametric distributions (e.g. Gaussian Mixture Models) has been studied intensively, we investigate a non-parametric variant based on vector quantization (VQ) in this paper. The essential advantage of the proposed VQ approach over stateof-the-art similarity measures is that the proposed audio similarity metric forms a normed vector space. This allows for more powerful search strategies, e.g. KD-Trees or Local Sensitive Hashing (LSH), making content-based audio similarity available for even larger music archives. Standard VQ approaches are known to be computationally very expensive; to counter this problem, we propose a multi-level clustering architecture. Additionally, we show that the multi-level vector quantization approach (ML-VQ), in contrast to standard VQ approaches, is comparable to state-ofthe-art frame-level similarity measures in terms of quality. Another important finding w.r.t. the ML-VQ approach is that, in contrast to GMM models of songs, our approach does not seem to suffer from the recently discovered hub problem. 1.
On Centroidal Voronoi TessellationEnergy Smoothness and Fast Computation
- ACM Trans. Graph
, 2009
"... Centroidal Voronoi tessellation (CVT) is a particular type of Voronoi tessellation that has many applications in computational sciences and engineering, including computer graphics. The prevailing method for computing CVT is Lloyd’s method, which has linear convergence and is inefficient in practice ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Centroidal Voronoi tessellation (CVT) is a particular type of Voronoi tessellation that has many applications in computational sciences and engineering, including computer graphics. The prevailing method for computing CVT is Lloyd’s method, which has linear convergence and is inefficient in practice. We develop new efficient methods for CVT computation and demonstrate the fast convergence of these methods. Specifically, we show that the CVT energy function has 2nd order smoothness for convex domains with smooth density, as well as in most situations encountered in optimization. Due to the 2nd order smoothness, it is possible to minimize the CVT energy functions using Newton-like optimization methods and expect fast convergence. We propose a quasi-Newton method to compute CVT and demonstrate its faster convergence than Lloyd’s method with various numerical examples. It is also significantly faster and more robust than the Lloyd-Newton method, a previous attempt to accelerate CVT. We also demonstrate surface remeshing as a possible application.
deSEO: Combating Search-Result Poisoning
- In Proceedings of the 20th USENIX Security Symposium
, 2011
"... We perform an in-depth study of SEO attacks that spread malware by poisoning search results for popular queries. Such attacks, although recent, appear to be both widespread and effective. They compromise legitimate Web sites and generate a large number of fake pages targeting trendy keywords. We fir ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
We perform an in-depth study of SEO attacks that spread malware by poisoning search results for popular queries. Such attacks, although recent, appear to be both widespread and effective. They compromise legitimate Web sites and generate a large number of fake pages targeting trendy keywords. We first dissect one example attack that affects over 5,000 Web domains and attracts over 81,000 user visits. Further, we develop de-SEO, a system that automatically detects these attacks. Using large datasets with hundreds of billions of URLs, deSEO successfully identifies multiple malicious SEO campaigns. In particular, applying the URL signatures derived from deSEO, we find 36 % of sampled searches to Google and Bing contain at least one malicious link in the top results at the time of our experiment. 1

