Results 1  10
of
67
Models and issues in data stream systems
 IN PODS
, 2002
"... In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, timevarying data streams. In addition to reviewing past work releva ..."
Abstract

Cited by 770 (19 self)
 Add to MetaCart
(Show Context)
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, timevarying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
Finding frequent items in data streams
, 2002
"... Abstract. We present a 1pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves bett ..."
Abstract

Cited by 344 (0 self)
 Add to MetaCart
Abstract. We present a 1pass algorithm for estimating the most frequent items in a data stream using very limited storage space. Our method relies on a novel data structure called a count sketch, which allows us to estimate the frequencies of all the items in the stream. Our algorithm achieves better space bounds than the previous best known algorithms for this problem for many natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this problem has not been previously studied in the literature. 1
An Information Statistics Approach to Data Stream and Communication Complexity
, 2003
"... We present a new method for proving strong lower bounds in communication complexity. ..."
Abstract

Cited by 240 (8 self)
 Add to MetaCart
We present a new method for proving strong lower bounds in communication complexity.
Nearoptimal lower bounds on the multiparty communication complexity of set disjointness
 In IEEE Conference on Computational Complexity
, 2003
"... We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their ..."
Abstract

Cited by 89 (7 self)
 Add to MetaCart
(Show Context)
We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their sets are disjoint. In the more restrictive oneway communication model, in which the players are required to speak in a predetermined order, we improve our bound to an optimal Ω(n/t). These results improve upon the earlier bounds of Ω(n/t 2) in the general model, and Ω(ε 2 n/t 1+ε) in the oneway model, due to BarYossef, Jayram, Kumar, and Sivakumar [5]. As in the case of earlier results, our bounds apply to the unique intersection promise problem. This communication problem is known to have connections with the space complexity of approximating frequency moments in the data stream model. Our results lead to an improved space complexity lower bound of Ω(n 1−2/k / log n) for approximating the k th frequency moment with a constant number of passes over the input, and a technical improvement to Ω(n 1−2/k) if only one pass over the input is permitted. Our proofs rely on the information theoretic direct sum decomposition paradigm of BarYossef et al [5]. Our improvements stem from novel analytical tech
Simpler algorithm for estimating frequency moments of data streams
 PROCEEDINGS OF THE SEVENTEENTH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHM
, 2006
"... The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for e ..."
Abstract

Cited by 45 (4 self)
 Add to MetaCart
The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for estimating Fk, for k > 2, using space Õ(n12/k), matching the space lower bound (up to polylogarithmic factors) for this problem [1, 2, 3, 4, 13] (n is the number of distinct items occurring in the stream.) In this paper, we present a simpler 1pass algorithm for estimating Fk.
Two Applications of Information Complexity
, 2003
"... We show the following new lower bounds in two concrete complexity models: (1) In the twoparty communication complexity model, we show that the tribes function on n inputs [6] has twosided error randomized complexity # n), while its nondeterminstic complexity and conondeterministic complexity are ..."
Abstract

Cited by 40 (1 self)
 Add to MetaCart
We show the following new lower bounds in two concrete complexity models: (1) In the twoparty communication complexity model, we show that the tribes function on n inputs [6] has twosided error randomized complexity # n), while its nondeterminstic complexity and conondeterministic complexity are both #( # n). This separation between randomized and nondeterministic complexity is the best possible and it settles an open problem in Kushilevitz and Nisan [17], which was also posed by Beame and Lawry [5].
Sketching and Streaming Entropy via Approximation Theory
"... We conclude a sequence of work by giving nearoptimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and nearopti ..."
Abstract

Cited by 32 (0 self)
 Add to MetaCart
(Show Context)
We conclude a sequence of work by giving nearoptimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain suboptimal space bounds in the general model, and nearoptimal bounds in the insertiononly model without sketching. Our highlevel approach is simple: we give algorithms to estimate Rényi and Tsallis entropy, and use them to extrapolate an estimate of Shannon entropy. The accuracy of our estimates is proven using approximation theory arguments and extremal properties of Chebyshev polynomials, a technique which may be useful for other problems. Our work also yields the bestknown and nearoptimal additive approximations for entropy, and hence also for conditional entropy and mutual information.
Buffering in query evaluation over XML streams
 In Proceedings of the 24th ACM Symposium on Principles of Database Systems (PODS
, 2005
"... All known algorithms for evaluating advanced XPath queries (e.g., ones with predicates or with closure axes) on XML streams employ buffers to temporarily store fragments of the document stream. In many cases, these buffers grow very large and constitute a major memory bottleneck. In this paper, we i ..."
Abstract

Cited by 31 (4 self)
 Add to MetaCart
(Show Context)
All known algorithms for evaluating advanced XPath queries (e.g., ones with predicates or with closure axes) on XML streams employ buffers to temporarily store fragments of the document stream. In many cases, these buffers grow very large and constitute a major memory bottleneck. In this paper, we identify two broad classes of evaluation problems that independently necessitate the use of large memory buffers in evaluation of queries over XML streams: (1) fullfledged evaluation (as opposed to just filtering) of queries with predicates; (2) evaluation (whether fullfledged or filtering) of queries with “multivariate ” predicates. We prove quantitative lower bounds on the amount of memory required in each of these scenarios. The bounds are stated in terms of novel document properties that we define. We show that these scenarios, in combination with query evaluation over recursive documents, cover the cases in which large buffers are required. Finally, we present algorithms that match the lower bounds for an important fragment of XPath.
Fast moment estimation in data streams in optimal space
 In Proceedings of the 43rd ACM Symposium on Theory of Computing (STOC
, 2011
"... We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptim ..."
Abstract

Cited by 27 (6 self)
 Add to MetaCart
(Show Context)
We give a spaceoptimal algorithm with update time O(log 2 (1/ε) log log(1/ε)) for (1 ± ε)approximating the pth frequency moment, 0 < p < 2, of a lengthn vector updated in a data stream. This provides a nearly exponential improvement in the update time complexity over the previous spaceoptimal algorithm of [KaneNelsonWoodruff, SODA 2010], which had update time Ω(1/ε 2). 1