Results 1  10
of
238
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 538 (22 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Index Coding with Side Information
, 2006
"... Motivated by a problem of transmitting supplemental data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R1,..., Rn. He holds an input x ∈ {0, 1} n and wishes to broadcast a single message so that each receiver Ri c ..."
Abstract

Cited by 105 (0 self)
 Add to MetaCart
(Show Context)
Motivated by a problem of transmitting supplemental data over broadcast channels (Birk and Kol, INFOCOM 1998), we study the following coding problem: a sender communicates with n receivers R1,..., Rn. He holds an input x ∈ {0, 1} n and wishes to broadcast a single message so that each receiver Ri can recover the bit xi. Each Ri has prior side information about x, induced by a directed graph G on n nodes; Ri knows the bits of x in the positions {j  (i, j) is an edge of G}. G is known to the sender and to the receivers. We call encoding schemes that achieve this goal INDEX codes for {0, 1} n with side information graph G. In this paper we identify a measure on graphs, the minrank, which exactly characterizes the minimum length of linear and certain types of nonlinear INDEX codes. We show that for natural classes of side information graphs, including directed acyclic graphs, perfect graphs, odd holes, and odd antiholes, minrank is the optimal length of arbitrary INDEX codes. For arbitrary INDEX codes and arbitrary graphs, we obtain a lower bound in terms of the size of the maximum acyclic induced subgraph. This bound holds even for randomized codes, but is shown not to be tight.
Nearoptimal lower bounds on the multiparty communication complexity of set disjointness
 In IEEE Conference on Computational Complexity
, 2003
"... We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their ..."
Abstract

Cited by 89 (7 self)
 Add to MetaCart
(Show Context)
We study the communication complexity of the set disjointness problem in the general multiparty model. For t players, each holding a subset of a universe of size n, we establish a nearoptimal lower bound of Ω(n/(t log t)) on the communication complexity of the problem of determining whether their sets are disjoint. In the more restrictive oneway communication model, in which the players are required to speak in a predetermined order, we improve our bound to an optimal Ω(n/t). These results improve upon the earlier bounds of Ω(n/t 2) in the general model, and Ω(ε 2 n/t 1+ε) in the oneway model, due to BarYossef, Jayram, Kumar, and Sivakumar [5]. As in the case of earlier results, our bounds apply to the unique intersection promise problem. This communication problem is known to have connections with the space complexity of approximating frequency moments in the data stream model. Our results lead to an improved space complexity lower bound of Ω(n 1−2/k / log n) for approximating the k th frequency moment with a constant number of passes over the input, and a technical improvement to Ω(n 1−2/k) if only one pass over the input is permitted. Our proofs rely on the information theoretic direct sum decomposition paradigm of BarYossef et al [5]. Our improvements stem from novel analytical tech
Optimal space lower bounds for all frequency moments
 In SODA
, 2004
"... Abstract We prove that any onepass streaming algorithm which (ffl, ffi)approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the ..."
Abstract

Cited by 81 (14 self)
 Add to MetaCart
(Show Context)
Abstract We prove that any onepass streaming algorithm which (ffl, ffi)approximates the kth frequency moment Fk, for any real k 6 = 1 and any ffl = \Omega i 1pm j, must use \Omega \Gamma 1ffl2 \Delta bits of space, where m is the size of the universe. This is optimal in terms of ffl, resolves the open questions of BarYossef et al in [3, 4], and extends the \Omega \Gamma 1ffl2 \Delta lower bound for F0 in [11] to much smaller ffl by applying novel techniques. Along the way we lower bound the oneway communication complexity of approximating the Hamming distance and the number of bipartite graphs with minimum/maximum degree constraints. 1 Introduction Computing statistics on massive data sets is increasinglyimportant these days. Advances in communication and storage technology enable large bodies of raw datato be generated daily, and consequently, there is a rising demand to process this data efficiently. Sinceit is impractical for an algorithm to store even a small fraction of the data stream, its performance istypically measured by the amount of space it uses. In many scenarios, such as internet routing, once a streamelement is examined it is lost forever unless explicitly saved by the processing algorithm. This, along with thesheer size of the data, makes multiple passes over the data infeasible. In this paper we restrict our attention toonepass streaming algorithms and we investigate their space complexity.Let a =
Streaming and sublinear approximation of entropy and information distances
 In ACMSIAM Symposium on Discrete Algorithms
, 2006
"... In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the pr ..."
Abstract

Cited by 69 (13 self)
 Add to MetaCart
(Show Context)
In most algorithmic applications which compare two distributions, information theoretic distances are more natural than standard ℓp norms. In this paper we design streaming and sublinear time property testing algorithms for entropy and various information theoretic distances. Batu et al posed the problem of property testing with respect to the JensenShannon distance. We present optimal algorithms for estimating bounded, symmetric fdivergences (including the JensenShannon divergence and the Hellinger distance) between distributions in various property testing frameworks. Along the way, we close a (log n)/H gap between the upper and lower bounds for estimating entropy H, yielding an optimal algorithm over all values of the entropy. In a data stream setting (sublinear space), we give the first algorithm for estimating the entropy of a distribution. Our algorithm runs in polylogarithmic space and yields an asymptotic constant factor approximation scheme. An integral part of the algorithm is an interesting use of an F0 (the number of distinct elements in a set) estimation algorithm; we also provide other results along the space/time/approximation tradeoff curve. Our results have interesting structural implications that connect sublinear time and space constrained algorithms. The mediating model is the random order streaming model, which assumes the input is a random permutation of a multiset and was first considered by Munro and Paterson in 1980. We show that any property testing algorithm in the combined oracle model for calculating a permutation invariant functions can be simulated in the random order model in a single pass. This addresses a question raised by Feigenbaum et al regarding the relationship between property testing and stream algorithms. Further, we give a polylogspace PTAS for estimating the entropy of a one pass random order stream. This bound cannot be achieved in the combined oracle (generalized property testing) model. 1
Quantum and Classical Strong Direct Product Theorems and Optimal TimeSpace Tradeoffs
 SIAM Journal on Computing
, 2004
"... A strong direct product theorem says that if we want to compute k independent instances of a function, using less than k times the resources needed for one instance, then our overall success probability will be exponentially small in k. We establish such theorems for the classical as well as quantum ..."
Abstract

Cited by 66 (12 self)
 Add to MetaCart
A strong direct product theorem says that if we want to compute k independent instances of a function, using less than k times the resources needed for one instance, then our overall success probability will be exponentially small in k. We establish such theorems for the classical as well as quantum query complexity of the OR function. This implies slightly weaker direct product results for all total functions. We prove a similar result for quantum communication protocols computing k instances of the Disjointness function. Our direct product theorems...
How to Compress Interactive Communication
, 2009
"... We describe new ways to simulate 2party communication protocols to get protocols with potentially smaller communication. We show that every communication protocol that communicates C bits and reveals I bits of information to the participating parties can be simulated by a new protocol involving at ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
We describe new ways to simulate 2party communication protocols to get protocols with potentially smaller communication. We show that every communication protocol that communicates C bits and reveals I bits of information to the participating parties can be simulated by a new protocol involving at most Õ(√CI) bits of communication. In the case that the parties have inputs that are independent of each other, we get much better results, showing how to carry out the simulation with Õ(I) bits of communication. These results lead to a direct sum theorem for randomized communication complexity. Ignoring polylogarithmic factors, we show that for worst case computation, computing n copies of a function requires √ n times the communication required for computing on copy of the function. For average case complexity, given any distribution µ on inputs, computing n copies of the function on n independent inputs sampled according to µ requires √ n times the communication for computing one copy. If µ is a product distribution, computing n copies on n independent inputs sampled according to µ requires n times the communication required for computing the function. We also study the complexity of computing the sum (or parity) of n evaluations of f,
Simpler algorithm for estimating frequency moments of data streams
 PROCEEDINGS OF THE SEVENTEENTH ANNUAL ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHM
, 2006
"... The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for e ..."
Abstract

Cited by 45 (4 self)
 Add to MetaCart
The problem of estimating the kth frequency moment Fk over a data stream by looking at the items exactly once as they arrive was posed in [1, 2]. A succession of algorithms have been proposed for this problem [1, 2, 6, 8, 7]. Recently, Indyk and Woodruff [11] have presented the first algorithm for estimating Fk, for k > 2, using space Õ(n12/k), matching the space lower bound (up to polylogarithmic factors) for this problem [1, 2, 3, 4, 13] (n is the number of distinct items occurring in the stream.) In this paper, we present a simpler 1pass algorithm for estimating Fk.
Distributed verification and hardness of distributed approximation
 CoRR
"... We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in t ..."
Abstract

Cited by 45 (13 self)
 Add to MetaCart
(Show Context)
We study the verification problem in distributed networks, stated as follows. Let H be a subgraph of a network G where each vertex of G knows which edges incident on it are in H. We would like to verify whether H has some properties, e.g., if it is a tree or if it is connected (every node knows in the end of the process whether H has the specified property or not). We would like to perform this verification in a decentralized fashion via a distributed algorithm. The time complexity of verification is measured as the number of rounds of distributed communication. In this paper we initiate a systematic study of distributed verification, and give almost tight lower bounds on the running time of distributed verification algorithms for many A full version of this paper is available as [5] at
Information Equals Amortized Communication
, 2010
"... We show how to efficiently simulate the sending of a message M to a receiver who has partial information about the message, sothat the expected number of bits communicated in the simulationis closeto the amount ofadditionalinformationthatthemessagerevealstothereceiver. Thisisageneralizationandstreng ..."
Abstract

Cited by 38 (6 self)
 Add to MetaCart
(Show Context)
We show how to efficiently simulate the sending of a message M to a receiver who has partial information about the message, sothat the expected number of bits communicated in the simulationis closeto the amount ofadditionalinformationthatthemessagerevealstothereceiver. Thisisageneralizationandstrengtheningof the SlepianWolftheorem, which showshow to carryout such a simulation with low amortized communication in the case that M is a deterministic function of X. A caveat is that our simulation is interactive. As a consequence, we obtain new relationships between the randomized amortized communication complexity of a function, and its information complexity. We prove that for any given distribution on inputs, the internal information cost (namely the information revealed to the parties) involved in computing any relation or function using a two party interactive protocol is exactly equal to the amortized communication complexity of computing independent copies of the same relation or function. Here by amortized communication complexity we mean the average per copy communication in the best protocol for computing multiple copies, with a bound on the error in each copy. This significantly simplifies the relationships between the various measures of complexity for average case communication protocols, and proves that if a function’s information cost is smaller than its communication complexity, then multiple copies of the function can be computed more