Results 1  10
of
63
Approximate aggregation techniques for sensor databases
 In ICDE
, 2004
"... In the emerging area of sensorbased systems, a significant challenge is to develop scalable, faulttolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, w ..."
Abstract

Cited by 300 (6 self)
 Add to MetaCart
(Show Context)
In the emerging area of sensorbased systems, a significant challenge is to develop scalable, faulttolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, exemplified by TinyDB and Cougar, which allow users to perform aggregation queries such as MIN, COUNT and AVG on a sensor network. Due to power and range constraints, centralized approaches are generally impractical, so most systems use innetwork aggregation to reduce network traffic. Also, aggregation strategies must provide faulttolerance to address the issues of packet loss and node failures inherent in such a system. An unfortunate consequence of standard methods is that they typically introduce duplicate values, which must be accounted for to compute aggregates correctly. Another consequence of loss in the network is that exact aggregation is not possible in general. With this in mind, we investigate the use of approximate innetwork aggregation using small sketches. Our contributions are as follows: 1) we generalize well known duplicateinsensitive sketches for approximating COUNT to handle SUM (and by extension, AVG and other aggregates), 2) we present and analyze methods for using sketches to produce accurate results with low communication and computation overhead (even on lowpowered CPUs with little storage and no floating point operations), and 3) we present an extensive experimental validation of our methods. 1
Streaming Pattern Discovery in Multiple TimeSeries
 In VLDB
, 2005
"... In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream col ..."
Abstract

Cited by 105 (18 self)
 Add to MetaCart
(Show Context)
In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection.
Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles
 In SIGMOD
, 2005
"... While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be sim ..."
Abstract

Cited by 99 (22 self)
 Add to MetaCart
(Show Context)
While traditional database systems optimize for performance on oneshot queries, emerging largescale monitoring applications require continuous tracking of complex aggregates and datadistribution summaries over collections of physicallydistributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteedquality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributedstreams setting — our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holisticaggregate functions (e.g., “heavyhitters ” queries). We present the first known distributedtracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication and spaceefficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worstcase guarantees. 1.
Improving Collection Selection with Overlap Awareness in P2P Search Engines
 In SIGIR
, 2005
"... Collection selection has been a research issue for years. Typically, in related work, precomputed statistics are employed in order to estimate the expected result quality of each collection, and subsequently the collections are ranked accordingly. Our thesis is that this simple approach is insuffici ..."
Abstract

Cited by 66 (22 self)
 Add to MetaCart
(Show Context)
Collection selection has been a research issue for years. Typically, in related work, precomputed statistics are employed in order to estimate the expected result quality of each collection, and subsequently the collections are ranked accordingly. Our thesis is that this simple approach is insufficient for several applications in which the collections typically overlap. This is the case, for example, for the collections built by autonomous peers crawling the web. We argue for the extension of existing quality measures using estimators of mutual overlap among collections and present experiments in which this combination outperforms CORI, a popular approach based on quality estimation. We outline our prototype implementation of a P2P web search engine, coined MINERVA 1, that allows handling large amounts of data in a distributed and selforganizing manner. We conduct experiments which show that taking overlap into account during collection selection can drastically decrease the number of collections that have to be contacted in order to reach a satisfactory level of recall, which is a great step toward the feasibility of distributed web search.
Maintaining Sliding Window Skylines on Data Streams
 IEEE Transactions on Knowledge and Data Engineering
, 2006
"... Abstract—The skyline of a multidimensional data set contains the “best ” tuples according to any preference function that is monotonic on each dimension. Although skyline computation has received considerable attention in conventional databases, the existing algorithms are inapplicable to stream app ..."
Abstract

Cited by 55 (7 self)
 Add to MetaCart
(Show Context)
Abstract—The skyline of a multidimensional data set contains the “best ” tuples according to any preference function that is monotonic on each dimension. Although skyline computation has received considerable attention in conventional databases, the existing algorithms are inapplicable to stream applications because 1) they assume static data that are stored in the disk (rather than continuously arriving/expiring), 2) they focus on “onetime ” execution that returns a single skyline (in contrast to constantly tracking skyline changes), and 3) they aim at reducing the I/O overhead (as opposed to minimizing the CPUcost and mainmemory consumption). This paper studies skyline computation in stream environments, where query processing takes into account only a “sliding window ” covering the most recent tuples. We propose algorithms that continuously monitor the incoming data and maintain the skyline incrementally. Our techniques utilize several interesting properties of stream skylines to improve space/time efficiency by expunging data from the system as early as possible (i.e., before their expiration). Furthermore, we analyze the asymptotical performance of the proposed solutions, and evaluate their efficiency with extensive experiments. Index Terms—Skyline, stream, database, algorithm. 1
Space efficient mining of multigraph streams
 In Proceedings of the ACM SIGMODSIGACTSIGART Symposium on Principles of Database Systems (PODS
"... The challenge of monitoring massive amounts of data generated by communication networks has led to the interest in data stream processing. We study streams of edges in massive communication multigraphs, defined by (source, destination) pairs. The goal is to compute properties of the underlying g ..."
Abstract

Cited by 48 (8 self)
 Add to MetaCart
(Show Context)
The challenge of monitoring massive amounts of data generated by communication networks has led to the interest in data stream processing. We study streams of edges in massive communication multigraphs, defined by (source, destination) pairs. The goal is to compute properties of the underlying graph while using small space (much smaller than the number of communicants), and to avoid bias introduced because some edges may appear many times, while others are seen only once. We give results for three fundamental problems on multigraph degree sequences: estimating frequency moments of degrees, finding the heavy hitter degrees, and computing range sums of degree values. In all cases we are able to show space bounds for our summarizing algorithms that are significantly smaller than storing complete information. We use a variety of data stream methods: sketches, sampling, hashing and distinct counting, but a common feature is that we use cascaded summaries: nesting multiple estimation techniques within one another. In our experimental study, we see that such summaries are highly effective, enabling massive multigraph streams to be effectively summarized to answer queries of interest with high accuracy using only a small amount of space. 1.
Distributed SetExpression Cardinality Estimation
, 2004
"... We consider the problem of estimating setexpression cardinality in a distributed streaming environment where rapid update streams originating at remote sites are continually transmitted to a central processing system. ..."
Abstract

Cited by 33 (9 self)
 Add to MetaCart
We consider the problem of estimating setexpression cardinality in a distributed streaming environment where rapid update streams originating at remote sites are continually transmitted to a central processing system.
Coresets in Dynamic Geometric Data Streams
, 2005
"... A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1,..., ∆} d [26]. We develop streaming (1 + ɛ)approximation algorithms for kmedian, kmeans, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), m ..."
Abstract

Cited by 32 (4 self)
 Add to MetaCart
A dynamic geometric data stream consists of a sequence of m insert/delete operations of points from the discrete space {1,..., ∆} d [26]. We develop streaming (1 + ɛ)approximation algorithms for kmedian, kmeans, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), maximum spanning tree (MaxST), and average distance over dynamic geometric data streams. Our algorithms maintain a small weighted set of points (a coreset) that approximates with probability 2/3 the current point set with respect to the considered problem during the m insert/delete operations of the data stream. They use poly(ɛ −1, log m, log ∆) space and update time per insert/delete operation for constant k and dimension d. Having a coreset one only needs a fast approximation algorithm for the weighted problem to compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its running time may still be polynomial in n. For example one can compute in poly(log n, exp(O((1+log(1/ɛ)/ɛ) d−1))) time a solution to kmedian and kmeans [21] where n is the size of the current point set and k and d are constants. Finding an implicit solution to MaxCut can be done in poly(log n, exp((1/ɛ) O(1))) time. For MaxST and average distance we require poly(log n, ɛ −1) time and for MaxWM we require O(n 3) time to do this.
Proof sketches: Verifiable innetwork aggregation
 In IEEE Internation Conference on Data Engineering (ICDE
, 2007
"... Recent work on distributed, innetwork aggregation assumes a benign population of participants. Unfortunately, modern distributed systems are plagued by malicious participants. In this paper we present a first step towards verifiable yet efficient distributed, innetwork aggregation in adversarial s ..."
Abstract

Cited by 31 (6 self)
 Add to MetaCart
(Show Context)
Recent work on distributed, innetwork aggregation assumes a benign population of participants. Unfortunately, modern distributed systems are plagued by malicious participants. In this paper we present a first step towards verifiable yet efficient distributed, innetwork aggregation in adversarial settings. We describe a general framework and threat model for the problem and then present proof sketches, a compact verification mechanism that combines cryptographic signatures and FlajoletMartin sketches to guarantee acceptable aggregation error bounds with high probability. We derive proof sketches for count aggregates and extend them for random sampling, which can be used to provide verifiable approximations for a broad class of dataanalysis queries, e.g., quantiles and heavy hitters. Finally, we evaluate the practical use of proof sketches, and observe that adversaries can often be reduced to much smaller violations in practice than our worstcase bounds suggest. 1.
Adaptive stream filters for entitybased queries with nonvalue tolerance
 in VLDB
, 2005
"... We study the problem of applying adaptive filters for approximate query processing in a distributed stream environment. We propose filter bound assignment protocols with the objective of reducing communication cost. Most previous works focus on valuebased queries (e.g., average) with numerical erro ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
We study the problem of applying adaptive filters for approximate query processing in a distributed stream environment. We propose filter bound assignment protocols with the objective of reducing communication cost. Most previous works focus on valuebased queries (e.g., average) with numerical error tolerance. In this paper, we cover entitybased queries (e.g., a nearest neighbor query returns object names rather than a single value). In particular, we study nonvaluebased tolerance (e.g., the answer to the nearestneighbor query should rank third or above). We investigate different nonvaluebased error tolerance definitions and discuss how they are applied to two classes of entitybased queries: nonrankbased and rankbased queries. Extensive experiments show that our protocols achieve significant savings in both communication overhead and server computation. 1