Results 1  10
of
20
Tight bounds for lp samplers, finding duplicates in streams, and related problems
 In PODS
, 2011
"... In this paper, we present nearoptimal space bounds for Lpsamplers. Given a stream of updates (additions and subtraction) to the coordinates of an underlying vector x ∈ R n, a perfect Lp sampler outputs the ith coordinate with probability xi p/‖x‖pp. In SODA 2010, Monemizadeh and Woodruff showe ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we present nearoptimal space bounds for Lpsamplers. Given a stream of updates (additions and subtraction) to the coordinates of an underlying vector x ∈ R n, a perfect Lp sampler outputs the ith coordinate with probability xi p/‖x‖pp. In SODA 2010, Monemizadeh and Woodruff showed polylog space upper bounds for approximate Lpsamplers and demonstrated various applications of them. Very recently, Andoni, Krauthgamer and Onak improved the upper bounds and gave a O(ǫ−p log3 n) space ǫ relative error and constant failure rate Lpsampler for p ∈ [1, 2]. In this work, we give another such algorithm requiring only O(ǫ−p log2 n) space for p ∈ (1, 2). For p ∈ (0, 1), our space bound is O(ǫ−1 log2 n), while for the p = 1 case we have an O(log(1/ǫ)ǫ−1 log2 n) space algorithm. We also give a O(log2 n) bits zero relative error L0sampler, improving the O(log3 n) bits algorithm due to Frahling, Indyk and Sohler. As an application of our samplers, we give better upper bounds for the problem of finding duplicates in data streams. In case the length of the stream is longer than the alphabet size, L1 sampling gives us an O(log 2 n) space algorithm, thus improving the previous O(log3 n) bound due to Gopalan and Radhakrishnan. In the second part of our work, we prove an Ω(log2 n) lower bound for sampling from 0, ±1 vectors (in this special case, the parameter p is not relevant for Lp sampling). This matches the space of our sampling algorithms for constant ǫ> 0. We also prove tight space lower bounds for the finding duplicates and heavy hitters problems. We obtain these lower bounds using reductions from the communication complexity problem augmented indexing.
Measuring independence of datasets
 CoRR
"... Approximating pairwise, or kwise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of ktuples, with the goal of testing correlations among the components measured over the entire stream. In ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Approximating pairwise, or kwise, independence with sublinear memory is of considerable importance in the data stream model. In the streaming model the joint distribution is given by a stream of ktuples, with the goal of testing correlations among the components measured over the entire stream. Indyk and McGregor (SODA 08) recently gave exciting new results for measuring pairwise independence in this model. Statistical distance is one of the most fundamental metrics for measuring the similarity of two distributions, and it has been a metric of choice in many papers that discuss distribution closeness. For pairwise independence, the Indyk and McGregor methods provide log napproximation under statistical distance between the joint and product distributions in the streaming model. Indyk and McGregor leave, as their main open question, the problem of improving their log napproximation for the statistical distance metric. In this paper we solve the main open problem posed by Indyk and McGregor for the statistical distance for pairwise independence and extend this result to any constant k. In particular, we present an algorithm that computes an (ɛ, δ)approximation of the statistical distance between the joint and product distributions defined by a stream of 1 nm ktuples. Our algorithm requires O log( ɛ δ)) (30+k) k) memory and a single pass over the data stream.
ZeroOne Frequency Laws
"... Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AM ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can ∑ i∈[n] G(mi) be approximated efficiently, where “efficiently ” means by a single pass over data stream and polylogarithmic memory. No such characterization was known despite a tremendous amount of research on frequencybased functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zeroone law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G: R ↦ → R such that G(0) = 0 and G can be computed in polylogarithmic time and space and ask, for which G in this class is there an (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) for any polylogarithmic ɛ? We give an algebraic characterization for all such G so that: • For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) with polylogarithmic memory and a single pass over data stream; while • For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound
Optimal random sampling from distributed streams revisited
 In DISC
, 2011
"... Abstract. We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample f ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
Abstract. We give an improved algorithm for drawing a random sample from a large data stream when the input elements are distributed across multiple sites which communicate via a central coordinator. At any point in time the set of elements held by the coordinator represent a uniform random sample from the set of all the elements observed so far. When compared with prior work, our algorithms asymptotically improve the total number of messages sent in the system as well as the computation required of the coordinator. We also present a matching lower bound, showing that our protocol sends the optimal number of messages up to a constant factor with large probability. As a byproduct, we obtain an improved algorithm for finding the heavy hitters across multiple distributed sites.
The Continuous Distributed Monitoring Model
, 2013
"... In the model of continuous distributed monitoring, a number of observers each see a stream of observations. Their goal is to work together to compute a function of the union of their observations. This can be as simple as counting the total number of observations, or more complex nonlinear function ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
In the model of continuous distributed monitoring, a number of observers each see a stream of observations. Their goal is to work together to compute a function of the union of their observations. This can be as simple as counting the total number of observations, or more complex nonlinear functions such as tracking the entropy of the induced distribution. Assuming that it is too costly to simply centralize all the observations, it becomes quite challenging to design solutions which provide a good approximation to the current answer, while bounding the communication cost of the observers, and their other resources such as their space usage. This survey introduces this model, and describe a selection results in this setting, from the simple counting problem to a variety of other functions that have been studied.
Element Distinctness, Frequency Moments, and Sliding Windows
"... Abstract — We derive new timespace tradeoff lower bounds and algorithms for exactly computing statistics of input data, including frequency moments, element distinctness, and order statistics, that are simple to calculate for sorted data. In particular, we develop a randomized algorithm for the ele ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract — We derive new timespace tradeoff lower bounds and algorithms for exactly computing statistics of input data, including frequency moments, element distinctness, and order statistics, that are simple to calculate for sorted data. In particular, we develop a randomized algorithm for the element distinctness problem whose time T and space S satisfy T ∈ Õ(n3/2/S1/2), smaller than previous lower bounds for comparisonbased algorithms, showing that element distinctness is strictly easier than sorting for randomized branching programs. This algorithm is based on a new time and spaceefficient algorithm for finding all collisions of a function f from a finite set to itself that are reachable by iterating f from a given set of starting points. We further show that our element distinctness algorithm can be extended at only a polylogarithmic factor cost to solve the element distinctness problem over sliding windows [18], where the task is to take an input of length 2n − 1 and produce an output for each window of length n, giving n outputs in total. In contrast, we show a timespace tradeoff lower bound of T ∈ Ω(n2/S) for randomized multiway branching programs, and hence standard RAM and wordRAM models, to compute the number of distinct elements, F0, over sliding windows. The same lower bound holds for computing the loworder bit of F0 and computing any frequency moment Fk for k 6 = 1. This shows that frequency moments Fk 6 = 1 and even the decision problem F0 mod 2 are strictly harder than element distinctness. We provide even stronger separations on average for inputs from [n]. We complement this lower bound with a T ∈ Õ(n2/S) comparisonbased deterministic RAM algorithm for exactly computing Fk over sliding windows, nearly matching both our general lower bound for the slidingwindow version and the comparisonbased lower bounds for a single instance of the problem. We also consider the computations of order statistics over sliding windows.
Independent range sampling
 In PODS
, 2014
"... This paper studies the independent range sampling problem. The input is a set P of n points in R. Given an interval q = [x, y] and an integer t ≥ 1, a query returns t elements uniformly sampled (with/without replacement) from P ∩ q. The sampling result must be independent from those returned by the ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
This paper studies the independent range sampling problem. The input is a set P of n points in R. Given an interval q = [x, y] and an integer t ≥ 1, a query returns t elements uniformly sampled (with/without replacement) from P ∩ q. The sampling result must be independent from those returned by the previous queries. The objective is to store P in a structure for answering all queries efficiently. If P fits in memory, the problem is interesting when P is dynamic (i.e., allowing insertions and deletions). The state of the art is a structure of O(n) space that answers a query in O(t log n) time, and supports an update in O(log n) time. We describe a new structure of O(n) space that answers a query in O(log n+t) expected time, and supports an update in O(log n) time. If P does not fit in memory, the problem is challenging even when P is static. The best known structure incurs O(logB n + t) I/Os per query, where B is the block size. We develop a new structure of O(n/B) space that answers a query in O(log⋆(n/B)+logB n+(t/B) logM/B(n/B)) amortized expected I/Os, where M is the memory size, and log⋆(n/B) is the number of iterative log2(.) operations we need to perform on n/B before going below a constant. We also give a lower bound argument showing that this is nearly optimal—in particular, the multiplicative term logM/B(n/B) is necessary. Categories and Subject Descriptors F.2.2 [Analysis of algorithms and problem complex
Don’t Let The Negatives Bring You Down: Sampling from Streams of Signed Updates ABSTRACT
"... Random sampling has been proven time and time again to be a powerful tool for working with large data. Queries over the full dataset are replaced by approximate queries over the smaller (and hence easier to store and manipulate) sample. The sample constitutes a flexible summary that supports a wide ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Random sampling has been proven time and time again to be a powerful tool for working with large data. Queries over the full dataset are replaced by approximate queries over the smaller (and hence easier to store and manipulate) sample. The sample constitutes a flexible summary that supports a wide class of queries. But in many applications, datasets are modified with time, and it is desirable to update samples without requiring access to the full underlying datasets. In this paper, we introduce and analyze novel techniques for sampling over dynamic data, modeled as a stream of modifications to weights associated with each key. While sampling schemes designed for stream applications can often readily accommodate positive updates to the dataset, much less is known for the case of negative updates, where weights are reduced or items deleted altogether. We primarily consider the turnstile model of streams, and extend classic schemes to incorporate negative updates. Perhaps surprisingly, the modifications to handle negative updates turn out to be natural and seamless extensions of the wellknown positive updateonly algorithms. We show that they produce unbiased estimators, and we relate their performance to the behavior of corresponding algorithms on insertonly streams with different parameters. A careful analysis is necessitated, in order to account for the fact that sampling choices for one key now depend on the choices made for other keys. In practice, our solutions turn out to be efficient and accurate. Compared to recent algorithms for Lp sampling which can be applied to this problem, they are significantly more reliable, and dramatically faster.
Approximating Sliding Windows by Cyclic TreeLike Histograms for Efficient Range Queries
"... The issue of providing fast approximate answers to range queries on sliding windows with a small consumption of storage space is one of the main challenges in the context of data streams. On the one hand, the importance of this class of queries is widely accepted. They are indeed useful to compute a ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
The issue of providing fast approximate answers to range queries on sliding windows with a small consumption of storage space is one of the main challenges in the context of data streams. On the one hand, the importance of this class of queries is widely accepted. They are indeed useful to compute aggregate information over the data stream, allowing us to extract from it more abstract knowledge than point queries. On the other hand, the usage of techniques like synopses based on histograms, sketches, sampling, and so on, makes effective those approaches which require multiple scans on data, which otherwise would be prohibitive from the computational point of view. Among the above techniques, histogrambased approaches are considered one of the most advantageous solutions, at least in case of range queries. It is a matter of fact that histograms show a very good capability of summarizing data preserving quick and accurate answers to range queries. In this paper, we propose a novel histogrambased technique to reduce sliding windows supporting approximate arbitrary rangesum queries. Our histogram, relying on a treebased structure, is suitable to directly support hierarchical queries and, thus, drilldown and rollup operations. In addition, the structure well supports sliding window shifting and quick query answering, since it operates in logarithmic time in the sliding window size. A bitsaving approach to encoding tree nodes allows us to compress the sliding window with a little price in terms of accuracy. The contribution of this work is thus not only the proposal of a new specific technique to tackle an important problem but also a deep analysis of the advantages given by the hierarchical approach combined with the bitsaving strategy. A careful experimental analysis validates the method showing its superiority w.r.t. the state of the art.
Towards an Algorithmic Theory
 of Compressed Sensing, Rutgers Univ., 2005, Tech. Rep
"... In the model of continuous distributed monitoring, a number of observers each see a stream of observations. Their goal is to work together to compute a function of the union of their observations. This can be as simple as counting the total number of observations, or more complex nonlinear function ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In the model of continuous distributed monitoring, a number of observers each see a stream of observations. Their goal is to work together to compute a function of the union of their observations. This can be as simple as counting the total number of observations, or more complex nonlinear functions such as tracking the entropy of the induced distribution. Assuming that it is too costly to simply centralize all the observations, it becomes quite challenging to design solutions which provide a good approximation to the current answer, while bounding the communication cost of the observers, and their other resources such as their space usage. This survey introduces this model, and describe a selection results in this setting, from the simple counting problem to a variety of other functions that have been studied.