Results 1  10
of
34
Conquering the divide: Continuous clustering of distributed data streams
 In Intl. Conf. on Data Engineering
, 2007
"... Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this pap ..."
Abstract

Cited by 35 (4 self)
 Add to MetaCart
(Show Context)
Data is often collected over a distributed network, but in many cases, is so voluminous that it is impractical and undesirable to collect it in a central location. Instead, we must perform distributed computations over the data, guaranteeing high quality answers even as new data arrives. In this paper, we formalize and study the problem of maintaining a clustering of such distributed data that is continuously evolving. In particular, our goal is to minimize the communication and computational cost, still providing guaranteed accuracy of the clustering. We focus on the kcenter clustering, and provide a suite of algorithms that vary based on which centralized algorithm they derive from, and whether they maintain a single global clustering or many local clusterings that can be merged together. We show that these algorithms can be designed to give accuracy guarantees that are close to the best possible even in the centralized case. In our experiments, we see clear trends among these algorithms, showing that the choice of algorithm is crucial, and that we can achieve a clustering that is as good as the best centralized clustering, with only a small fraction of the communication required to collect all the data in a single location. 1
A ptas for kmeans clustering based on weak coresets
 DELIS – Dynamically Evolving, LargeScale Information Systems
, 2007
"... Given a point set P ⊆ R d the kmeans clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points t ..."
Abstract

Cited by 34 (11 self)
 Add to MetaCart
(Show Context)
Given a point set P ⊆ R d the kmeans clustering problem is to find a set C = {c1,..., ck} of k points and a partition of P into k clusters C1,..., Ck such that the sum of squared errors �k � i=1 p∈C �p − ci� i 2 2 is minimized. For given centers this cost function is minimized by assigning points to the nearest center. The kmeans cost function is probably the most widely used cost function in the area of clustering. In this paper we show that every unweighted point set P has a weak (ɛ, k)coreset of size poly(k, 1/ɛ) for the kmeans clustering problem, i.e. its size is independent of the cardinality P  of the point set and the dimension d of the Euclidean space R d. A weak coreset is a weighted set S ⊆ P together with a set T such that T contains a (1 + ɛ)approximation for the optimal cluster centers from P and for every set of k centers from T the cost of the centers for S is a (1 ± ɛ)approximation of the cost for P. We apply our weak coreset to obtain a PTAS for the kmeans clustering problem with running time O(nkd + d · poly(k/ɛ) +
Summarizing and mining inverse distributions on data streams via dynamic inverse sampling
 VLDB
, 2005
"... Database management systems face the challenge of dealing with massive data distributions which arrive at high speeds while there is only small storage available for managing and mining them. Emerging data stream management systems approach this problem by summarizing and mining the distributions us ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
(Show Context)
Database management systems face the challenge of dealing with massive data distributions which arrive at high speeds while there is only small storage available for managing and mining them. Emerging data stream management systems approach this problem by summarizing and mining the distributions using samples or sketches. However, data distributions can be “viewed” in different ways. For example, a data stream of integer values can be viewed either as the forward distribution f(x), ie., the number of occurrences of x in the stream, or as its inverse, f −1 (i), which is the number of items that appear i times. While both such “views ” are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions. We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods. We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.
Private coresets
, 2009
"... A coreset of a point set P is a small weighted set of points that captures some geometric properties of P. Coresets have found use in a vast host of geometric settings. We forge a link between coresets, and differentially private sanitizations that can answer any number of queries without compromisi ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
(Show Context)
A coreset of a point set P is a small weighted set of points that captures some geometric properties of P. Coresets have found use in a vast host of geometric settings. We forge a link between coresets, and differentially private sanitizations that can answer any number of queries without compromising privacy. We define the notion of private coresets, which are simultaneously both coresets and differentially private, and show how they may be constructed. We first show that the existence of a small coreset with low generalized sensitivity (i.e., replacing a single point in the original point set slightly affects the quality of the coreset) implies (in an inefficient manner) the existence of a private coreset for the same queries. This greatly extends the works of Blum, Ligett, and Roth [STOC 2008] and McSherry and Talwar [FOCS 2007]. We also give an efficient algorithm to compute private coresets for kmedian and kmean queries in ℜ d, immediately implying efficient differentially private sanitizations for such queries. Following McSherry and Talwar, this construction also gives efficient coalition proof (approximately dominant strategy) mechanisms for location problems. Unlike coresets which only have a multiplicative approximation factor, we prove that private coresets must exhibit additive error. We present a new technique for showing lower bounds on this error.
On kmedian clustering in high dimensions
 In: Proceedings of the 17th Annual ACMSIAM Symposium on Discrete Algorithms (SODA ’06), Society for Industrial and Applied Mathematics
, 2006
"... We study approximation algorithms for kmedian clustering. We obtain small coresets for kmedian clustering in metric spaces as well as in Euclidean spaces. Specifically, in IR d, those coresets are of size with only polynomial dependency on d. This leads to a (1 + ε)approximation algorithm for km ..."
Abstract

Cited by 26 (0 self)
 Add to MetaCart
We study approximation algorithms for kmedian clustering. We obtain small coresets for kmedian clustering in metric spaces as well as in Euclidean spaces. Specifically, in IR d, those coresets are of size with only polynomial dependency on d. This leads to a (1 + ε)approximation algorithm for kmedian clustering in IR d, with running time O(ndk + 2 (k/ε)O(1) d2nσ), for any σ>0. This is an improvement over previous results [5, 20, 21]. We also provide fast constant factor approximation algorithms for kmedian clustering in finite metric spaces. We use those coresets to compute (1 + ɛ)approximation kmedian clustering in the streaming model of computation, using only O(k2dɛ−2 log 8 n) space, where the points are taken from IR d. This is the first streaming algorithm, for this problem, that has space complexity with only polynomial dependency on the dimension. 1
Fast and accurate kmeans for large datasets.
 In NIPS*24,
, 2011
"... Abstract Clustering is a popular problem with many applications. We consider the kmeans problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is base ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
Abstract Clustering is a popular problem with many applications. We consider the kmeans problem in the situation where the data is too large to be stored in main memory and must be accessed sequentially, such as from a disk, and where we must use as little memory as possible. Our algorithm is based on recent theoretical results, with significant improvements to make it practical. Our approach greatly simplifies a recently developed algorithm, both in design and in analysis, and eliminates large constant factors in the approximation guarantee, the memory requirements, and the running time. We then incorporate approximate nearest neighbor search to compute kmeans in o(nk) (where n is the number of data points; note that computing the cost, given a solution, takes Θ(nk) time). We show that our algorithm compares favorably to existing algorithms both theoretically and experimentally, thus providing stateoftheart performance in both theory and practice.
Sublineartime algorithms
 In Oded Goldreich, editor, Property Testing, volume 6390 of Lecture Notes in Computer Science
, 2010
"... In this paper we survey recent (up to end of 2009) advances in the area of sublineartime algorithms. 1 ..."
Abstract

Cited by 20 (2 self)
 Add to MetaCart
(Show Context)
In this paper we survey recent (up to end of 2009) advances in the area of sublineartime algorithms. 1
A SpaceOptimal DataStream Algorithm for Coresets in the Plane
"... Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process each point. This is the first spaceoptimal datastream algorithm for this problem. As a consequence, we obtain improved datastream approximation algorithms for other extent measures, such as width, robust kernels, as well as εkernels in higher dimensions.
Approximate clustering of distributed data streams
 ICDE Conference
, 2008
"... Abstract We investigate the problem of clustering on distributed data streams. In particular, we consider the kmedian clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to lim ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract We investigate the problem of clustering on distributed data streams. In particular, we consider the kmedian clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to limited communication capacity, storage space, and computing power at each site. In this paper, we propose a suite of algorithms for computing (1 + ε)approximate kmedian clustering over distributed data streams under three different topology settings: topologyoblivious, heightaware, and pathaware. Our algorithms reduce the maximum per node transmission to polylog N (opposed to Ω(N ) for transmitting the raw data). We have simulated our algorithms on a distributed stream system with both real and synthetic datasets composed of millions of data. In practice, our algorithms are able to reduce the data transmission to a small fraction of the original data. Moreover, our results indicate that the algorithms are scalable with respect to the data volume, approximation factor, and the number of sites.