Results 1 - 10
of
30
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract
-
Cited by 533 (22 self)
- Add to MetaCart
(Show Context)
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Coresets in Dynamic Geometric Data Streams
, 2005
"... A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1,..., ∆} d [26]. We develop streaming (1 + ɛ)-approximation algorithms for k-median, k-means, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), m ..."
Abstract
-
Cited by 34 (4 self)
- Add to MetaCart
A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1,..., ∆} d [26]. We develop streaming (1 + ɛ)-approximation algorithms for k-median, k-means, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), maximum spanning tree (MaxST), and average distance over dynamic geometric data streams. Our algorithms maintain a small weighted set of points (a coreset) that approximates with probability 2/3 the current point set with respect to the considered problem during the m insert/delete operations of the data stream. They use poly(ɛ −1, log m, log ∆) space and update time per insert/delete operation for constant k and dimension d. Having a coreset one only needs a fast approximation algorithm for the weighted problem to compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its running time may still be polynomial in n. For example one can compute in poly(log n, exp(O((1+log(1/ɛ)/ɛ) d−1))) time a solution to k-median and k-means [21] where n is the size of the current point set and k and d are constants. Finding an implicit solution to MaxCut can be done in poly(log n, exp((1/ɛ) O(1))) time. For MaxST and average distance we require poly(log n, ɛ −1) time and for MaxWM we require O(n 3) time to do this.
Deterministic Sampling and Range Counting in Geometric Data Streams
- In Proc. 20th ACM Sympos. Comput. Geom
, 2004
"... We present memory-efficient deterministic algorithms for constructing #-nets and #-approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used t ..."
Abstract
-
Cited by 32 (0 self)
- Add to MetaCart
(Show Context)
We present memory-efficient deterministic algorithms for constructing #-nets and #-approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used to answer approximate online iceberg geometric queries on data streams. We use these techniques to approximate several robust statistics of geometric data streams, including Tukey depth, simplicial depth, regression depth, the Thiel-Sen estimator, and the least median of squares. Our algorithms use only a polylogarithmic amount of memory, provided the desired approximation factors are inverse-polylogarithmic. We also include a lower bound for non-iceberg geometric queries.
Optimal tracking of distributed heavy hitters and quantiles
- In PODS
, 2009
"... We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
(Show Context)
We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, the φ-heavy hitters are those elements of A whose frequency in A is at least φ|A|; the φ-quantile of A is an element x of U such that at most φ|A | elements of A are smaller than A and at most (1 − φ)|A | elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a two-way communication channel to a designated coordinator, whose goal is to track the set of φ-heavy hitters and the φ-quantile of A approximately at all times with minimum communication. We give tracking algorithms with worst-case communication cost O(k/ǫ · log n) for both problems, where n is the total number of items in A, and ǫ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the φ-quantiles for all 0 ≤ φ ≤ 1. 1
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract
-
Cited by 22 (7 self)
- Add to MetaCart
(Show Context)
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for ε-approximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for ε-approximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full merge-ε ε ability. We also extend our results to geometric summaries such as ε-approximations and ε-kernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for ε-approximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS-05-40347, IIS-07-
Adaptive spatial partitioning for multidimensional data streams
- In ISAAC
, 2004
"... We propose a space-efficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track ε-hotspots, which are congruent boxes containing at least an ε fraction of the stream, ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
(Show Context)
We propose a space-efficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track ε-hotspots, which are congruent boxes containing at least an ε fraction of the stream, and maintain hierarchical heavy hitters in d dimensions. Our sketch can also be viewed as a multidimensional generalization of the ε-approximate quantile summary. The space complexity of our scheme is O ( 1 ε log R) if the points lie in the domain [0, R]d, where d is assumed to be a constant. The scheme extends to the sliding window model with a log(εn) factor increase in space, where n is the size of the sliding window. Our sketch can also be used to answer ε-approximate rectangular range queries over a stream of d-dimensional points. 1
A Space-Optimal Data-Stream Algorithm for Coresets in the Plane
"... Given a point set P ⊆ R², a subset Q ⊆ P is an ε-kernel of P if for every slab W containing Q, the (1+ε)-expansion of W also contains P. We present a data-stream algorithm for maintaining an ε-kernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
Given a point set P ⊆ R², a subset Q ⊆ P is an ε-kernel of P if for every slab W containing Q, the (1+ε)-expansion of W also contains P. We present a data-stream algorithm for maintaining an ε-kernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process each point. This is the first space-optimal data-stream algorithm for this problem. As a consequence, we obtain improved data-stream approximation algorithms for other extent measures, such as width, robust kernels, as well as ε-kernels in higher dimensions.
Comparing distributions and shapes using the kernel distance
- In ACM SoCG
, 2011
"... ..."
(Show Context)
Algorithms for ε-approximation of terrains
, 2008
"... Consider a point set D with a measure functionµ: D→R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axis-parallel rectangles, half planes, balls, k-oriented polygons). We say a range space (D, A) has anε-approximation P if ..."
Abstract
-
Cited by 10 (10 self)
- Add to MetaCart
(Show Context)
Consider a point set D with a measure functionµ: D→R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axis-parallel rectangles, half planes, balls, k-oriented polygons). We say a range space (D, A) has anε-approximation P if
eps-samples for kernels
- Proceedings 24th Annual ACM-SIAM Symposium on Discrete Algorithms
, 2013
"... We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel d ..."
Abstract
-
Cited by 5 (5 self)
- Add to MetaCart
(Show Context)
We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel density estimates of the input distribution with that of the subset and bound the worst case error. If the maximum error is ε, then this subset can be thought of as an ε-sample (aka an ε-approximation) of the range space defined with the input distribution as the ground set and the fixed kernel representing the family of ranges. Interestingly, in this case the ranges are not binary, but have a continuous range (for simplicity we focus on kernels with range of [0, 1]); these allow for smoother notions of range spaces. It turns out, the use of this smoother family of range spaces has an added benefit of greatly decreasing the size required for ε-samples. For instance, in the plane the size is O((1/ε 4/3) log 2/3 (1/ε)) for disks (based on VC-dimension arguments) but is only O((1/ε) √ log(1/ε)) for Gaussian kernels and for kernels with bounded slope that only affect a bounded domain. These bounds are accomplished by studying the discrepancy of these “kernel ” range spaces, and here the improvement in bounds are even more pronounced. In the plane, we show the discrepancy is O ( √ log n) for these kernels, whereas for