Results 1  10
of
120
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 533 (22 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
ModelDriven Data Acquisition in Sensor Networks
 IN VLDB
, 2004
"... Declarative queries are proving to be an attractive paradigm for interacting with networks of wireless sensors. The metaphor that "the sensornet is a database" is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor ..."
Abstract

Cited by 449 (36 self)
 Add to MetaCart
(Show Context)
Declarative queries are proving to be an attractive paradigm for interacting with networks of wireless sensors. The metaphor that "the sensornet is a database" is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor readings onto physical reality, a model of that reality is required to complement the readings. In this paper, we enrich interactive sensor querying with statistical modeling techniques. We demonstrate that such models can help provide answers that are both more meaningful, and, by introducing approximations with probabilistic confidences, significantly more efficient to compute in both time and energy. Utilizing the combination of a model and live data acquisition raises the challenging optimization problem of selecting the best sensor readings to acquire, balancing the increase in the confidence of our answer against the communication and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution to this optimization problem, and a polynomialtime heuristic for identifying solutions that perform well in practice. We evaluate our approach on several realworld sensornetwork data sets, taking into account the real measured data and communication quality, demonstrating that our modelbased approach provides a highfidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition techniques.
Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs
"... We introduce reductions in the streaming model as a tool in the design of streaming algorithms. We develop ..."
Abstract

Cited by 149 (5 self)
 Add to MetaCart
We introduce reductions in the streaming model as a tool in the design of streaming algorithms. We develop
Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs
, 2004
"... ..."
(Show Context)
Tracking join and selfjoin sizes in limited storage
, 2002
"... This paper presents algorithms for tracking (approximate) join and selfjoin sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and selfjoin sizes without an expensive recomputation from the base data, and without the lar ..."
Abstract

Cited by 123 (0 self)
 Add to MetaCart
This paper presents algorithms for tracking (approximate) join and selfjoin sizes in limited storage, in the presence of insertions and deletions to the data set(s). Such algorithms detect changes in join and selfjoin sizes without an expensive recomputation from the base data, and without the large space overhead required to maintain such sizes exactly. Query optimizers rely on fast, highquality estimates of join sizes in order to select between various join plans, and estimates of selfjoin sizes are used to indicate the degree of skew in the data. For selfjoins, we considertwo approaches proposed in [Alon, Matias, and Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS, vol. 58, 1999, p.137147], which we denote tugofwar and samplecount. Wepresent fast algorithms for implementing these approaches, and extensions to handle deletions as well as insertions. We also report on the rst experimental study of the two approaches, on a range of synthetic and realworld data sets. Our study shows that tugofwar provides more accurate estimates for a given storage limit than samplecount, which in turn is far more accurate than a standard samplingbased approach. For example, tugofwar needed only 4{256 memory words, depending on the data set, in order to estimate the selfjoin size
How to Summarize the Universe: Dynamic Maintenance of Quantiles
 In VLDB
, 2002
"... Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building us ..."
Abstract

Cited by 112 (15 self)
 Add to MetaCart
Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations, in estimating query result sizes when building user interfaces, and in characterizing the data distribution of evolving datasets in the process of data mining.
Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles
 In SIGMOD
, 2005
"... While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be sim ..."
Abstract

Cited by 103 (24 self)
 Add to MetaCart
(Show Context)
While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteedquality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributedstreams setting — our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holisticaggregate functions (e.g., “heavyhitters ” queries). We present the first known distributedtracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication and spaceefficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worstcase guarantees. 1.
Comparing data streams using hamming norms (how to zero in)
, 2003
"... Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networ ..."
Abstract

Cited by 81 (7 self)
 Add to MetaCart
(Show Context)
Massive data streams are now fundamental to many data processing applications. For example, Internet routers produce large scale diagnostic data streams. Such streams are rarely stored in traditional databases and instead must be processed “on the fly” as they are produced. Similarly, sensor networks produce multiple data streams of observations from their sensors. There is growing focus on manipulating data streams and, hence, there is a need to identify basic operations of interest in managing data streams, and to support them efficiently. We propose computation of the Hamming norm as a basic operation of interest. The Hamming norm formalizes ideas that are used throughout data processing. When applied to a single stream, the Hamming norm gives the number of distinct items that are present in that data stream, which is a statistic of great interest in databases. When applied to a pair of streams, the Hamming norm gives an important measure of (dis)similarity: the number of unequal item counts in the two streams. Hamming norms have many uses in comparing data streams. We present a novel approximation technique for estimating the Hamming norm for massive data streams; this relies on what we call the “l0 sketch ” and we prove its accuracy. We test our approximation method on a large quantity of synthetic and real stream data, and show that the estimation is accurate to within a few percentage points.
Sketching streams through the net: Distributed approximate query tracking
 In VLDB
, 2005
"... While traditional database systems optimize for performance on oneshot query processing, emerging largescale monitoring applications require continuous tracking of complex dataanalysis queries over collections of physicallydistributed streams. Thus, effective solutions have to be simultaneously s ..."
Abstract

Cited by 78 (20 self)
 Add to MetaCart
While traditional database systems optimize for performance on oneshot query processing, emerging largescale monitoring applications require continuous tracking of complex dataanalysis queries over collections of physicallydistributed streams. Thus, effective solutions have to be simultaneously space/time efficient (at each remote monitor site), communication efficient (across the underlying communication network), and provide continuous, guaranteedquality approximate query answers. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking a broad class of complex aggregate queries in such a distributedstreams setting. Our tracking schemes maintain approximate query answers with provable error guarantees, while simultaneously optimizing the storage space and processing time at each remote site, as well as the communication cost across the network. In a nutshell, our algorithms rely on tracking generalpurpose randomized sketch summaries of local streams at remote sites along with concise prediction models of local site behavior in order to produce highly communication and space/timeefficient solutions. The end result is a powerful approximate query tracking framework that readily incorporates several complex analysis queries (including distributed join and multijoin aggregates, and approximate wavelet representations), thus giving the first known lowoverhead tracking solution for such queries in the distributedstreams model. Experiments with real data validate our approach, revealing significant savings over naive solutions as well as our analytical worstcase guarantees. 1
An Optimal Algorithm for the Distinct Elements Problem
"... We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, ne ..."
Abstract

Cited by 67 (7 self)
 Add to MetaCart
(Show Context)
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,..., n}, our algorithm computes a (1 ± ε)approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worstcase time, and can report an estimate at any point midstream in O(1) worstcase time, thus settling both the space and time complexities simultaneously.