Results 1  10
of
71
Kmeans++: The advantages of careful seeding.
 In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’07,
, 2007
"... Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, ran ..."
Abstract

Cited by 478 (8 self)
 Add to MetaCart
(Show Context)
Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting kmeans with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of kmeans, often quite dramatically.
Creating efficient codebooks for visual recognition
 In Proceedings of the IEEE International Conference on Computer Vision
, 2005
"... Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as kmeans to cluster the descriptor vect ..."
Abstract

Cited by 276 (22 self)
 Add to MetaCart
Visual codebook based quantization of robust appearance descriptors extracted from local image patches is an effective means of capturing image statistics for texture analysis and scene classification. Codebooks are usually constructed by using a method such as kmeans to cluster the descriptor vectors of patches sampled either densely (‘textons’) or sparsely (‘bags of features ’ based on keypoints or salience measures) from a set of training images. This works well for texture analysis in homogeneous images, but the images that arise in natural object recognition tasks have far less uniform statistics. We show that for dense sampling, kmeans overadapts to this, clustering centres almost exclusively around the densest few regions in descriptor space and thus failing to code other informative regions. This gives suboptimal codes that are no better than using randomly selected centres. We describe a scalable acceptanceradius based clusterer that generates better codebooks and study its performance on several image classification tasks. We also show that dense representations outperform equivalent keypoint based ones on these tasks and that SVM or Mutual Information based feature selection starting from a dense codebook further improves the performance. 1.
Truthful Mechanisms for OneParameter Agents
"... In this paper, we show how to design truthful (dominant strategy) mechanisms for several combinatorial problems where each agent’s secret data is naturally expressed by a single positive real number. The goal of the mechanisms we consider is to allocate loads placed on the agents, and an agent’s sec ..."
Abstract

Cited by 232 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we show how to design truthful (dominant strategy) mechanisms for several combinatorial problems where each agent’s secret data is naturally expressed by a single positive real number. The goal of the mechanisms we consider is to allocate loads placed on the agents, and an agent’s secret data is the cost she incurs per unit load. We give an exact characterization for the algorithms that can be used to design truthful mechanisms for such load balancing problems using appropriate side payments. We use our characterization to design polynomial time truthful mechanisms for several problems in combinatorial optimization to which the celebrated VCG mechanism does not apply. For scheduling related parallel machines (QjjCmax), we give a 3approximation mechanism based on randomized rounding of the optimal fractional solution. This problem is NPcomplete, and the standard approximation algorithms (greedy loadbalancing or the PTAS) cannot be used in truthful mechanisms. We show our mechanism to be frugal, in that the total payment needed is only a logarithmic factor more than the actual costs incurred by the machines, unless one machine dominates the total processing power. We also give truthful mechanisms for maximum flow, Qjj P Cj (scheduling related machines to minimize the sum of completion times), optimizing an affine function over a fixed set, and special cases of uncapacitated facility location. In addition, for Qjj P wjCj (minimizing the weighted sum of completion times), we prove a lower bound of 2 p 3 for the best approximation ratio achievable by a truthful mechanism.
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, ..."
Abstract

Cited by 157 (5 self)
 Add to MetaCart
(Show Context)
The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams.
StreamingData Algorithms For HighQuality Clustering.
 ICDE Conference,
, 2002
"... Abstract ..."
(Show Context)
Better Streaming Algorithms for Clustering Problems
, 2003
"... We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomized algorithm for the kMedian problem which produces a constant factor approximati ..."
Abstract

Cited by 94 (1 self)
 Add to MetaCart
We study clustering problems in the streaming model, where the goal is to cluster a set of points by making one pass (or a few passes) over the data using a small amount of storage space. Our main result is a randomized algorithm for the kMedian problem which produces a constant factor approximation in one pass using storage space O(k poly log n). This is a significant improvement of the previous best algorithm which yielded a 2O(1/ffl) approximation using O(nffl) space. Next we give a streaming algorithm for the kMedian problem with an arbitrary distance function. We also study algorithms for clustering problems with outliers in the streaming model. Here, we give bicriterion guarantees, producing constant factor approximations by increasing the allowed fraction of outliers slightly.
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 67 (13 self)
 Add to MetaCart
(Show Context)
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
An online algorithm for maximizing submodular functions
, 2007
"... We present an algorithm for solving a broad class of online resource allocation jobs arrive one at a time, and one can complete the jobs by investing time in a number of abstract activities, according to some schedule. We assume that the fraction of jobs completed by a schedule is a monotone, submod ..."
Abstract

Cited by 59 (12 self)
 Add to MetaCart
(Show Context)
We present an algorithm for solving a broad class of online resource allocation jobs arrive one at a time, and one can complete the jobs by investing time in a number of abstract activities, according to some schedule. We assume that the fraction of jobs completed by a schedule is a monotone, submodular function of a set of pairs (v, τ), where τ is the time invested in activity v. Under this assumption, our online algorithm performs nearoptimally according to two natural metrics: (i) the fraction of jobs completed within time T, for some fixed deadline T> 0, and (ii) the average time required to complete each job. We evaluate our algorithm experimentally by using it to learn, online, a schedule for allocating CPU time among solvers entered in the 2007 SAT solver competition. 1
A general approach to online network optimization problems
 ACM Transactions on Algorithms
, 2004
"... ..."
(Show Context)
Algorithms for dynamic geometric problems over data streams
 In STOC ’04: Proceedings of the thirtysixth annual ACM symposium on Theory of computing
, 2004
"... ..."