Results 1  10
of
30
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 533 (22 self)
 Add to MetaCart
(Show Context)
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Coresets in Dynamic Geometric Data Streams
, 2005
"... A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1,..., ∆} d [26]. We develop streaming (1 + ɛ)approximation algorithms for kmedian, kmeans, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), m ..."
Abstract

Cited by 34 (4 self)
 Add to MetaCart
A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1,..., ∆} d [26]. We develop streaming (1 + ɛ)approximation algorithms for kmedian, kmeans, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), maximum spanning tree (MaxST), and average distance over dynamic geometric data streams. Our algorithms maintain a small weighted set of points (a coreset) that approximates with probability 2/3 the current point set with respect to the considered problem during the m insert/delete operations of the data stream. They use poly(ɛ −1, log m, log ∆) space and update time per insert/delete operation for constant k and dimension d. Having a coreset one only needs a fast approximation algorithm for the weighted problem to compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its running time may still be polynomial in n. For example one can compute in poly(log n, exp(O((1+log(1/ɛ)/ɛ) d−1))) time a solution to kmedian and kmeans [21] where n is the size of the current point set and k and d are constants. Finding an implicit solution to MaxCut can be done in poly(log n, exp((1/ɛ) O(1))) time. For MaxST and average distance we require poly(log n, ɛ −1) time and for MaxWM we require O(n 3) time to do this.
Deterministic Sampling and Range Counting in Geometric Data Streams
 In Proc. 20th ACM Sympos. Comput. Geom
, 2004
"... We present memoryefficient deterministic algorithms for constructing #nets and #approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used t ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
(Show Context)
We present memoryefficient deterministic algorithms for constructing #nets and #approximations of streams of geometric data. Unlike probabilistic approaches, these deterministic samples provide guaranteed bounds on their approximation factors. We show how our deterministic samples can be used to answer approximate online iceberg geometric queries on data streams. We use these techniques to approximate several robust statistics of geometric data streams, including Tukey depth, simplicial depth, regression depth, the ThielSen estimator, and the least median of squares. Our algorithms use only a polylogarithmic amount of memory, provided the desired approximation factors are inversepolylogarithmic. We also include a lower bound for noniceberg geometric queries.
Optimal tracking of distributed heavy hitters and quantiles
 In PODS
, 2009
"... We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, ..."
Abstract

Cited by 24 (9 self)
 Add to MetaCart
(Show Context)
We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, the φheavy hitters are those elements of A whose frequency in A is at least φA; the φquantile of A is an element x of U such that at most φA  elements of A are smaller than A and at most (1 − φ)A  elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a twoway communication channel to a designated coordinator, whose goal is to track the set of φheavy hitters and the φquantile of A approximately at all times with minimum communication. We give tracking algorithms with worstcase communication cost O(k/ǫ · log n) for both problems, where n is the total number of items in A, and ǫ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the φquantiles for all 0 ≤ φ ≤ 1. 1
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
(Show Context)
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for εapproximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for εapproximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full mergeε ε ability. We also extend our results to geometric summaries such as εapproximations and εkernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for εapproximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS0540347, IIS07
Adaptive spatial partitioning for multidimensional data streams
 In ISAAC
, 2004
"... We propose a spaceefficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track εhotspots, which are congruent boxes containing at least an ε fraction of the stream, ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
(Show Context)
We propose a spaceefficient scheme for summarizing multidimensional data streams. Our sketch can be used to solve spatial versions of several classical data stream queries efficiently. For instance, we can track εhotspots, which are congruent boxes containing at least an ε fraction of the stream, and maintain hierarchical heavy hitters in d dimensions. Our sketch can also be viewed as a multidimensional generalization of the εapproximate quantile summary. The space complexity of our scheme is O ( 1 ε log R) if the points lie in the domain [0, R]d, where d is assumed to be a constant. The scheme extends to the sliding window model with a log(εn) factor increase in space, where n is the size of the sliding window. Our sketch can also be used to answer εapproximate rectangular range queries over a stream of ddimensional points. 1
A SpaceOptimal DataStream Algorithm for Coresets in the Plane
"... Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process ..."
Abstract

Cited by 19 (5 self)
 Add to MetaCart
Given a point set P ⊆ R², a subset Q ⊆ P is an εkernel of P if for every slab W containing Q, the (1+ε)expansion of W also contains P. We present a datastream algorithm for maintaining an εkernel of a stream of points in R² that uses O(1/√ε) space and takes O(log(1/ε)) amortized time to process each point. This is the first spaceoptimal datastream algorithm for this problem. As a consequence, we obtain improved datastream approximation algorithms for other extent measures, such as width, robust kernels, as well as εkernels in higher dimensions.
Comparing distributions and shapes using the kernel distance
 In ACM SoCG
, 2011
"... ..."
(Show Context)
Algorithms for εapproximation of terrains
, 2008
"... Consider a point set D with a measure functionµ: D→R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axisparallel rectangles, half planes, balls, koriented polygons). We say a range space (D, A) has anεapproximation P if ..."
Abstract

Cited by 10 (10 self)
 Add to MetaCart
(Show Context)
Consider a point set D with a measure functionµ: D→R. Let A be the set of subsets of D induced by containment in a shape from some geometric family (e.g. axisparallel rectangles, half planes, balls, koriented polygons). We say a range space (D, A) has anεapproximation P if
epssamples for kernels
 Proceedings 24th Annual ACMSIAM Symposium on Discrete Algorithms
, 2013
"... We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel d ..."
Abstract

Cited by 5 (5 self)
 Add to MetaCart
(Show Context)
We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel density estimates of the input distribution with that of the subset and bound the worst case error. If the maximum error is ε, then this subset can be thought of as an εsample (aka an εapproximation) of the range space defined with the input distribution as the ground set and the fixed kernel representing the family of ranges. Interestingly, in this case the ranges are not binary, but have a continuous range (for simplicity we focus on kernels with range of [0, 1]); these allow for smoother notions of range spaces. It turns out, the use of this smoother family of range spaces has an added benefit of greatly decreasing the size required for εsamples. For instance, in the plane the size is O((1/ε 4/3) log 2/3 (1/ε)) for disks (based on VCdimension arguments) but is only O((1/ε) √ log(1/ε)) for Gaussian kernels and for kernels with bounded slope that only affect a bounded domain. These bounds are accomplished by studying the discrepancy of these “kernel ” range spaces, and here the improvement in bounds are even more pronounced. In the plane, we show the discrepancy is O ( √ log n) for these kernels, whereas for