Results 1  10
of
26
Scalable KMeans++
"... Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obta ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
(Show Context)
Over half a century old and showing no signs of aging, kmeans remains one of the most popular data processing algorithms. As is wellknown, a proper initialization of kmeans is crucial for obtaining a good final solution. The recently proposed kmeans++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the kmeans++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing kmeans that have mostly focused on the postinitialization phases of kmeans. We prove that our proposed initialization algorithm kmeans obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on realworld largescale data demonstrates that kmeans  outperforms kmeans++ in both sequential and parallel settings. 1.
Fast and accurate kmeans for large datasets.
 In NIPS*24,
, 2011
"... Abstract Clustering is a popular problem with many applications. We consider the kmeans problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is base ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
Abstract Clustering is a popular problem with many applications. We consider the kmeans problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute kmeans in o(nk) (where n is the number of data points; note that computing the cost, given a solution, takes Θ(nk) time). We show that our algorithm compares favorably to existing algorithms both theoretically and experimentally, thus providing stateoftheart performance in both theory and practice.
kmeans has polynomial smoothed complexity
 IN PROC. OF THE 50TH FOCS (ATLANTA, USA
, 2009
"... The kmeans method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worstcase running time. In order to close the gap between practical performance and theoretical analysis, the kmeans metho ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
(Show Context)
The kmeans method is one of the most widely used clustering algorithms, drawing its popularity from its speed in practice. Recently, however, it was shown to have exponential worstcase running time. In order to close the gap between practical performance and theoretical analysis, the kmeans method has been studied in the model of smoothed analysis. But even the smoothed analyses so far are unsatisfactory as the bounds are still superpolynomial in the number n of data points. In this paper, we settle the smoothed running time of the kmeans method. We show that the smoothed number of iterations is bounded by a polynomial in n and 1/σ, where σ is the standard deviation of the Gaussian perturbations. This means that if an arbitrary input data set is randomly perturbed, then the kmeans method will run in expected polynomial time on that input set.
kMLE: A fast algorithm for learning statistical mixture models
 CoRR
"... We describe kMLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectationmaximization (EM) soft clustering technique that monotonically increases th ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
(Show Context)
We describe kMLE, a fast and efficient local search algorithm for learning finite statistical mixtures of exponential families such as Gaussian mixture models. Mixture models are traditionally learned using the expectationmaximization (EM) soft clustering technique that monotonically increases the incomplete (expected complete) likelihood. Given prescribed mixture weights, the hard clustering kMLE algorithm iteratively assigns data to the most likely weighted component and update the component models using Maximum Likelihood Estimators (MLEs). Using the duality between exponential families and Bregman divergences, we prove that the local convergence of the complete likelihood of kMLE follows directly from the convergence of a dual additively weighted Bregman hard clustering. The inner loop of kMLE can be implemented using any kmeans heuristic like the celebrated Lloyd’s batched or Hartigan’s greedy swap updates. We then show how to update the mixture weights by minimizing a crossentropy criterion that implies to update weights by taking the relative proportion of cluster points, and reiterate the mixture parameter update and mixture weight update processes until convergence. Hard EM is interpreted as a special case of kMLE when both the component update and the weight update are performed successively in the inner loop. To initialize kMLE, we propose kMLE++, a careful initialization of kMLE guaranteeing probabilistically a global bound on the best possible complete likelihood.
Worstcase and smoothed analysis of kmeans clustering with Bregman divergences
 In Proc. of the 20th Int. Symp. on Algorithms and Computation (ISAAC), volume 5878 of Lecture Notes in Computer Science
, 2009
"... The kmeans algorithm is the method of choice for clustering largescale data sets and it performs exceedingly well in practice despite its exponential worstcase runningtime. To narrow the gap between theory and practice, kmeans has been studied in the semirandom input model of smoothed analysis ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
The kmeans algorithm is the method of choice for clustering largescale data sets and it performs exceedingly well in practice despite its exponential worstcase runningtime. To narrow the gap between theory and practice, kmeans has been studied in the semirandom input model of smoothed analysis, which often leads to more realistic conclusions than mere worstcase analysis. For the case that n data points in R d are perturbed by Gaussian noise with standard deviation σ, it has been shown that the expected runningtime is bounded by a polynomial in n and 1/σ. This result assumes that squared Euclidean distances are used as distance measure. In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the KullbackLeibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the runningtime of the kmeans method for Bregman divergences. We first give a smoothed analysis of kmeans with (almost) arbitrary Bregman divergences, and we show bounds of poly(n √ k, 1/σ) and k kd ·poly(n, 1/σ). The latter yields a polynomial bound if k and d are small compared to n. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.
Online Clustering with Experts
"... We propose an online clustering algorithm that manages the exploration/exploitation tradeoff using an adaptive weighting over batch clustering algorithms. We extend algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
We propose an online clustering algorithm that manages the exploration/exploitation tradeoff using an adaptive weighting over batch clustering algorithms. We extend algorithms for online supervised learning, with access to expert predictors, to the unsupervised learning setting. Instead of computing prediction errors in order to reweight the experts, the algorithm computes an approximation to the current value of the kmeans objective obtained by each expert. When the experts are batch clustering algorithms with bapproximation guarantees with respect to the kmeans objective (for example, the kmeans++ or kmeans # algorithms), applied to a sliding window of the data stream, our algorithm achieves an approximation guarantee with respect to the kmeans objective. The form of this online clustering approximation guarantee is novel, and extends an evaluation framework proposed by Dasgupta as an analog to regret. Our algorithm tracks the best clustering algorithm on real and simulated data sets.
Scalable KMeans by Ranked Retrieval∗
"... The kmeans clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every da ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The kmeans clustering algorithm has a long history and a proven practical performance, however it does not scale to clustering millions of data points into thousands of clusters in high dimensional spaces. The main computational bottleneck is the need to recompute the nearest centroid for every data point at every iteration, a prohibitive cost when the number of clusters is large. In this paper we show how to reduce the cost of the kmeans algorithm by large factors by adapting ranked retrieval techniques. Using a combination of heuristics, on two real life data sets the wall clock time per iteration is reduced from 445 minutes to less than 4, and from 705 minutes to 1.4, while the clustering quality remains within 0.5 % of the kmeans quality. The key insight is to invert the process of pointtocentroid
Recent developments in clustering algorithms
 Proceedings of the European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning
, 2012
"... Abstract. In this paper, we give a short review of recent developments in clustering. We shortly summarize important clustering paradigms before addressing important topics including metric adaptation in clustering, dealing with nonEuclidean data or large data sets, clustering evaluation, and lear ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we give a short review of recent developments in clustering. We shortly summarize important clustering paradigms before addressing important topics including metric adaptation in clustering, dealing with nonEuclidean data or large data sets, clustering evaluation, and learning theoretical foundations. 1
ContentBased Access Control
, 2015
"... In conventional database, the most popular access control model specifies policies explicitly for each role of every user against each data object manually. Nowadays, in largescale contentcentric data sharing, conventional approaches could be impractical due to exponential explosion of the data ..."
Abstract
 Add to MetaCart
In conventional database, the most popular access control model specifies policies explicitly for each role of every user against each data object manually. Nowadays, in largescale contentcentric data sharing, conventional approaches could be impractical due to exponential explosion of the data growth and the sensitivity of data objects. What’s more, conventional database access control policy will not be functional when the semantic content of data is expected to play a role in access decisions. Users are often overprivileged, and ex post facto auditing is enforced to detect misuse of the privileges. Unfortunately, it is usually difficult to reverse the damage, as (large amount of) data has been disclosed already. In this dissertation, we first introduce ContentBased Access Control (CBAC), an innovative access control model for contentcentric information sharing. As a complement to conventional access control models, the CBAC model makes access control decisions based on the content similarity between user credentials and data content automatically. In CBAC, each user is allowed by a meta