### Citations

Then, in iterations polynomial in B, we can find a B-term representation R such that √ ‖A − R‖≤ 1+ 2µB2 ‖A − Ropt‖. (1 − 2µB) 2 The concept of coherence has been generalized recently in [208], and made more widely applicable. Further in [108, 110], authors used approximate nearest neighbor algorithms to implement the iterations in Theorem 25 efficiently, and proved that approximate implementation works.

The problem of estimating m is called the missing mass problem. In a classical work by Good (attributed to Turing too) [35], it is shown that m is estimated by s[1]/f, provably with small bias; recall that our rarity fl is closely related to s[1]/f. Hence, our result here on estimating rarity in data streams is of independent interest in the context of estimating missing mass.

Yao's "two millionaires" problem [217] is related in which Paul and Carole each have a secret number and the problem is to determine whose secret is larger without revealing their secrets. These problems show the challenge in the emerging field of privacy preserving data mining.

(Show Context)
A paper that seems to have escaped the attention of approximation theory researchers is [63] which proves the general problem to be NP-Hard. This was reproved in [60]. In addition, [63] contained the following very nice result. Say to obtain a representation with error ε one needs B(ε) terms.

No deterministic algorithms are known for the factoring problem, but there are randomized algorithms take roughly O(k 2 logn) bits and time [214]. The elementary symmetric polynomial approach above comes from [169] where the authors solve the set reconciliation problem in the communication complexity model. The subset reconciliation problem is related to our puzzle.

We in the computer science community have traditionally focused on scaling in size: how to efficiently manipulate large disk-bound data via suitable data structures [213], how to scale to databases of petabytes [114], synthesize massive data sets [115], etc. However, far less attention has been given to benchmarking, studying performance of systems under rapid updates with near-real time analyses.

(Show Context)
The k-means algorithm on the data stream [100] can be seen as a tree method: building clusters on points, building higher level clusters on their representatives, and so on up the tree. Finally, I will speculate that Yair Bartal's fundamental results on tree embeddings may have applications in data streams.

This technique has been used in one dimensional nearest neighbor problems and facility location [46], maintaining statistics within a window [36], and from a certain perspective, for estimating the number of distinct items [34]. It is a simple and natural strategy which is likely to get used seamlessly in data stream algorithms.

(Show Context)
However, hardly any graph problem has been studied in the data stream model where (poly)log space requirement comes with other constraints. In [42], authors studied the problem of counting the number of triangles in the cash register model. Graph G = (V, E) is presented as a series of edges (u, v) ∈ E in no particular order. The problem is to estimate the number of triangles in the graph.

(Show Context)
In [75], authors proposed efficient approximation algorithms for a variety of two dimensional histograms for a static signal. Some preliminary results were presented in [76] for the streaming case: specifically, the authors proposed a polylog space, 1+ε approximation algorithm using O(B log N) partitions, taking Ω(N²) time. Using the ideas in [75] and robustness arguments, better algorithms are possible.

The elementary symmetric polynomial approach comes from [24] where the authors solve the set reconciliation problem in the communication complexity model. The subset reconciliation problem is related to our puzzle. Readers may have guessed that they may be a direct connection between the two problems.

Building a parse tree atop the Time Series data stream seen as a string [81]. This has applications to estimating string edit distances as well as estimating size of the smallest grammar to encode the string. Here is a problem of similar ilk, but it needs new ideas.

(Show Context)
Functional approximation theory has in general focused on characterizing the class of functions for which error has a certain decay as N → ∞. See [62] and [61] for many such problems. But from an algorithmicist's point of view, the nature of problems I discussed above are more clearly more appealing. This is a wonderful area for new algorithmic research; a start has been made.

The problem of estimating the number of inversions in a permutation was studied in [33]. Here is an outline of a simple algorithm to estimate the number of inversions [31]. Let At is the indicator array of the seen items before seeing the tth item, and It be the number of inversions so far.

This has been explored in the networking context for a variety of per-packet processing tasks (see eg. [5]) previously, but more needs to be done. There is commercial potential in such hardware machines.

Similar result has been proved in [51] using appropriate sampling for a fixed A, and recent progress is in [50] for similar problem using a few passes, but there are no results in the Turnstile Model. A lot of interesting technical issues arise in this problem.

In IP traffic, few flows send a large fraction of the traffic [209]. That is, of the 2^64 possible (src,dest) IP flows, if one is interested in heavy hitters, one is usually focused on a small number (few hundreds?) of flows. This means that one is typically interested in a sparse representation of the traffic matrix.

Even benchmarks of database transactions [115] are inadequate. There are ways to build workable systems around these TCS challenges. TCS systems are sophisticated and have developed high-level principles that still apply. Make things parallel. A lot of the data stream algorithms are inherently parallel.

Besides being Art, ambient information displays like the ones above are typically seen as Calming Technology [2]; they are also an attempt to transcode streaming data into a processible multi-sensory flow. 8.2 Short Data Stream History Data stream algorithms as an active research agenda has emerged only over the past few years.

Honestly, the fishing motif is silly: the total number of fish species in the sea is estimated to be roughly 22000 and anyone can afford an array of as many bits. In the reality of data streams which we will discuss, the number of distinct items can be in the billions, and the problem becomes interesting.

Then diameter can be estimated from the arcs given by these points. One gets an ε-approximation to the diameter with O(1/ε) space and O(log(1/ε)) compute time per inserted point [45]. I know of other results in progress, so more computational geometry problems will get solved in the data stream model in the near future.

Peter Winkler gives an interesting talk on the result in [53] which is a delightful read. Paul and Carole each have a secret name in mind, and the problem is for them to determine if their secrets are the same. If not, neither should learn the other's secret.

This is by now a well researched topic with positive results in very general settings [56]. However, these protocols have high complexity. But there is a demand for efficient solutions, perhaps with provable approximations, in practice. In [55] authors formalized the notion of approximate privacy preserving protocols.

The Computer Science community have traditionally focused on scaling wrt to size: how to efficiently manipulate large disk-bound data via suitable data structures [15], how to scale to databases of petabytes [106], synthesize massive datasets [7], etc. However, far less attention has been given to benchmarking, studying performance of systems under rapid updates with near-real time analyses.

The Internet is a general purpose network system that has distributed both the data sources as well as the data consumers over millions of users. It has scaled up the rate of transactions tremendously, with continuous large scale astronomical surveys in optical, infrared and radio wavelengths [117], atmospheric radiation measurements [108] etc.

Other dedicated network systems now provide massive data streams: satellite based, high resolution measurement of earth geodetics [118, 113], radar derived meteorological data [119], continuous large scale astronomical surveys in optical, infrared and radio wavelengths [117], atmospheric radiation measurements [108] etc. The Internet is a general purpose network system that has distributed both the data sources as well as the data consumers over millions of users.

