Results 1  10
of
53
Optimal tracking of distributed heavy hitters and quantiles
 In PODS
, 2009
"... We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, ..."
Abstract

Cited by 22 (9 self)
 Add to MetaCart
(Show Context)
We consider the the problem of tracking heavy hitters and quantiles in the distributed streaming model. The heavy hitters and quantiles are two important statistics for characterizing a data distribution. Let A be a multiset of elements, drawn from the universe U = {1,..., u}. For a given 0 ≤ φ ≤ 1, the φheavy hitters are those elements of A whose frequency in A is at least φA; the φquantile of A is an element x of U such that at most φA  elements of A are smaller than A and at most (1 − φ)A  elements of A are greater than x. Suppose the elements of A are received at k remote sites over time, and each of the sites has a twoway communication channel to a designated coordinator, whose goal is to track the set of φheavy hitters and the φquantile of A approximately at all times with minimum communication. We give tracking algorithms with worstcase communication cost O(k/ǫ · log n) for both problems, where n is the total number of items in A, and ǫ is the approximation error. This substantially improves upon the previous known algorithms. We also give matching lower bounds on the communication costs for both problems, showing that our algorithms are optimal. We also consider a more general version of the problem where we simultaneously track the φquantiles for all 0 ≤ φ ≤ 1. 1
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract

Cited by 21 (7 self)
 Add to MetaCart
(Show Context)
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for εapproximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for εapproximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full mergeε ε ability. We also extend our results to geometric summaries such as εapproximations and εkernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for εapproximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS0540347, IIS07
Spaceoptimal Heavy Hitters with Strong Error Bounds
, 2009
"... The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worstcase guarantees on real data. This leads to the question of whethe ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
The problem of finding heavy hitters and approximating the frequencies of items is at the heart of many problems in data stream analysis. It has been observed that several proposed solutions to this problem can outperform their worstcase guarantees on real data. This leads to the question of whether some stronger bounds can be guaranteed. We answer this in the positive by showing that a class of “counterbased algorithms” (including the popular and very spaceefficient FREQUENT and SPACESAVING algorithms) provide much stronger approximation guarantees than previously known. Specifically, we show that errors in the approximation of individual elements do not depend on the frequencies of the most frequent elements, but only on the frequency of the remaining “tail.” This shows that counterbased methods are the most spaceefficient (in fact, spaceoptimal) algorithms having this strong error bound. This tail guarantee allows these algorithms to solve the “sparse recovery ” problem. Here, the goal is to recover a faithful representation of the vector of frequencies, f. We prove that using space O(k), the algorithms construct an approximation f ∗ to the frequency vector f so that the L1 error ‖f − f ∗ ‖1 is close to the best possible error minf ′ ‖f ′ − f‖1, where f ′ ranges over all vectors with at most k nonzero entries. This improves the previously best known space bound of about O(k log n) for streams without element deletions (where n is the size of the domain from which stream elements are drawn). Other consequences of the tail guarantees are results for skewed (Zipfian) data, and guarantees for accuracy of merging multiple summarized streams.
Mining Hot Calling Contexts in Small Space
 In ACM Conference on Programming Language Design and Implementation
, 2011
"... Calling context trees (CCTs) associate performance metrics with paths through a program’s call graph, providing valuable information for program understanding and performance analysis. Although CCTs are typically much smaller than call trees, in real applications they might easily consist of tens of ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
Calling context trees (CCTs) associate performance metrics with paths through a program’s call graph, providing valuable information for program understanding and performance analysis. Although CCTs are typically much smaller than call trees, in real applications they might easily consist of tens of millions of distinct calling contexts: this sheer size makes them difficult to analyze and might hurt execution times due to poor access locality. For performance analysis, accurately collecting information about hot calling contexts may be more useful than constructing an entire CCT that includes millions of uninteresting paths. As we show for a variety of prominent Linux applications, the distribution of calling context frequencies is typically very skewed. In this paper we show how to exploit this property to reduce the CCT size considerably. We introduce a novel runtime data structure, called Hot Calling
ZeroOne Frequency Laws
"... Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AM ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can ∑ i∈[n] G(mi) be approximated efficiently, where “efficiently ” means by a single pass over data stream and polylogarithmic memory. No such characterization was known despite a tremendous amount of research on frequencybased functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zeroone law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G: R ↦ → R such that G(0) = 0 and G can be computed in polylogarithmic time and space and ask, for which G in this class is there an (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) for any polylogarithmic ɛ? We give an algebraic characterization for all such G so that: • For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) with polylogarithmic memory and a single pass over data stream; while • For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound
Finding the Frequent Items in Streams of Data
"... doi:10.1145/1562764.1562789 The frequent items problem is to process a stream of items and find all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indi ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
doi:10.1145/1562764.1562789 The frequent items problem is to process a stream of items and find all those which occur more than a given fraction of the time. It is one of the most heavily studied problems in mining data streams, dating back to the 1980s. Many other applications rely directly or indirectly on finding the frequent items, and implementations are in use in largescale industrial systems. In this paper, we describe the most important algorithms for this problem in a common framework. We place the different solutions in their historical context, and describe the connections between them, with the aim of clarifying some of the confusion that has surrounded their properties. To further illustrate the different properties of the algorithms, we provide baseline implementations. This allows us to give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware. 1.
Frequent items in streaming data: an experimental evaluation of the stateoftheart
 Data and Knowledge Engineering
"... The problem of detecting frequent items in streaming data is relevant to many different applications across many domains. Several algorithms, diverse in nature, have been proposed in the literature for the solution of the above problem. In this paper, we review these algorithms, and we present the r ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
(Show Context)
The problem of detecting frequent items in streaming data is relevant to many different applications across many domains. Several algorithms, diverse in nature, have been proposed in the literature for the solution of the above problem. In this paper, we review these algorithms, and we present the results of the first extensive comparative experimental study of the most prominent algorithms in the literature. The algorithms were comprehensively tested using a common test framework on several real and synthetic datasets. Their performance with respect to the different parameters (i.e., parameters intrinsic to the algorithms, and data related parameters) was studied. We report the results, and insights gained through these experiments. 1
FPGA Acceleration for the Frequent Item Problem
 In ICDE
, 2010
"... Abstract — Fieldprogrammable gate arrays (FPGAs) can provide performance advantages with a lower resource consumption (e.g., energy) than conventional CPUs. In this paper, we show how to employ FPGAs to provide an efficient and highperformance solution for the frequent item problem. We discuss thr ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
(Show Context)
Abstract — Fieldprogrammable gate arrays (FPGAs) can provide performance advantages with a lower resource consumption (e.g., energy) than conventional CPUs. In this paper, we show how to employ FPGAs to provide an efficient and highperformance solution for the frequent item problem. We discuss three design alternatives, each one of them exploiting different FPGA features, and we provide an exhaustive evaluation of their performance characteristics. The first design is a onetoone mapping of the SpaceSaving algorithm (shown to be the best approach in software [1]), built on special features of FPGAs: contentaddressable memory and dualported BRAM. The two other implementations exploit the flexibility of digital circuits to implement parallel lookups and pipelining strategies, resulting in significant improvements in performance. On lowcost FPGA hardware, the fastest of our designs can process 80 million items per second—three times as much as the best known result. Moreover, and unlike in software approaches where performance is directly related to the skew factor of the Zipf distribution, the high throughput is independent of the skew of the distribution of the input. In the paper we discuss as well several design tradeoffs that are relevant when implementing database functionality on FPGAs. In particular, we look at resource consumption and the levels of data and task parallelism of three different designs. I.
Optimizing Data Partitioning for DataParallel Computing
"... Performance of dataparallel computing (e.g., MapReduce, DryadLINQ) heavily depends on its data partitions. Solutions implemented by the current state of the art systems are far from optimal. Techniques proposed by the database community to find optimal data partitions are not directly applicable wh ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
Performance of dataparallel computing (e.g., MapReduce, DryadLINQ) heavily depends on its data partitions. Solutions implemented by the current state of the art systems are far from optimal. Techniques proposed by the database community to find optimal data partitions are not directly applicable when complex userdefined functions and data models are involved. We outline our solution, which draws expertise from various fields such as programming languages and optimization, and present our preliminary results.
gSketch: On Query Estimation in Graph Streams
"... Many dynamic applications are built upon large network infrastructures, such as social networks, communication networks, biological networks and the Web. Such applications create data that can be naturally modeled as graph streams, in which edges of the underlying graph are received and updated sequ ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
Many dynamic applications are built upon large network infrastructures, such as social networks, communication networks, biological networks and the Web. Such applications create data that can be naturally modeled as graph streams, in which edges of the underlying graph are received and updated sequentially in a form of a stream. It is often necessary and important to summarize the behavior of graph streams in order to enable effective query processing. However, the sheer size and dynamic nature of graph streams present an enormous challenge to existing graph management techniques. In this paper, we propose a new graph sketch method, gSketch, which combines well studied synopses for traditional data streams with a sketch partitioning technique, to estimate and optimize the responses to basic queries on graph streams. We consider two different scenarios for query estimation: (1) A graph stream sample is available; (2) Both a graph stream sample and a query workload sample are available. Algorithms for different scenarios are designed respectively by partitioning a global sketch to a group of localized sketches in order to optimize the query estimation accuracy. We perform extensive experimental studies on both real and synthetic data sets and demonstrate the power and robustness of gSketch in comparison with the stateoftheart global sketch method. 1.