Results 1  10
of
97
A Geometric Approach to Monitoring Threshold Functions Over Distributed Data Streams
 In ACM SIGMOD
"... Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of f ..."
Abstract

Cited by 92 (20 self)
 Add to MetaCart
(Show Context)
Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More involved challenges, such as the important task of feature selection (e.g., by monitoring the information gain of various features), still require very high communication overhead using naive, centralized algorithms. We present a novel geometric approach by which an arbitrary global monitoring task can be split into a set of constraints applied locally on each of the streams. The constraints are used to locally filter out data increments that do not affect the monitoring outcome, thus avoiding unnecessary communication. As a result, our approach enables monitoring of arbitrary threshold functions over distributed data streams in an efficient manner. We present experimental results on realworld data which demonstrate that our algorithms are highly scalable, and considerably reduce communication load in comparison to centralized algorithms. 1.
A nearoptimal algorithm for computing the entropy of a stream
 In ACMSIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract

Cited by 73 (20 self)
 Add to MetaCart
(Show Context)
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
Resource Sharing in Continuous SlidingWindow Aggregates
, 2004
"... We consider the problem of resource sharing when processing large numbers of continuous queries. We specifically address slidingwindow aggregates over data streams, an important class of continuous operators for which sharing has not been addressed. We present a suite of sharing techniques th ..."
Abstract

Cited by 63 (1 self)
 Add to MetaCart
(Show Context)
We consider the problem of resource sharing when processing large numbers of continuous queries. We specifically address slidingwindow aggregates over data streams, an important class of continuous operators for which sharing has not been addressed. We present a suite of sharing techniques that cover a wide range of possible scenarios: different classes of aggregation functions (algebraic, distributive, holistic), different window types (timebased, tuplebased, suffix, historical) , and different input models (single stream, multiple substreams). We provide precise theoretical performance guarantees for our techniques, and show their practical effectiveness through experimental study.
Fast and Approximate Stream Mining of Quantiles and Frequencies Using Graphics Processors
, 2005
"... We present algorithms for fast quantile and frequency estimation in large data streams using graphics processor units (GPUs). We exploit the high computational power and memory bandwidth of graphics processors and present a novel sorting algorithm that performs rasterization operations on the GPUs. ..."
Abstract

Cited by 56 (11 self)
 Add to MetaCart
(Show Context)
We present algorithms for fast quantile and frequency estimation in large data streams using graphics processor units (GPUs). We exploit the high computational power and memory bandwidth of graphics processors and present a novel sorting algorithm that performs rasterization operations on the GPUs. We use sorting as the main computational component for histogram approximation and the construction of ɛapproximate quantile and frequency summaries. Our overall algorithms for numerical statistics computation on data streams are deterministic, applicable to fixed or variablesized sliding windows and use a limited memory footprint. We use the GPU as a coprocessor and minimize the data transmission between the CPU and GPU by taking into account the low bus bandwidth. We have implemented our algorithms on a PC with a NVIDIA GeForce FX 6800 Ultra GPU and a 3.4 GHz Pentium IV CPU and applied them to large data streams consisting of more than 100 million values. We have compared their performance against optimized implementations of prior CPUbased algorithms. Overall, our results demonstrate that the graphics processor available on a commodity computer system is an efficient streamprocessor and a useful coprocessor for mining data streams.
Finding Frequent Items in Data Streams
 PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract

Cited by 55 (6 self)
 Add to MetaCart
(Show Context)
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
Diamond in the Rough: Finding Hierarchical Heavy Hitters in MultiDimensional Data
 In Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data
, 2004
"... Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize ..."
Abstract

Cited by 44 (2 self)
 Add to MetaCart
Data items archived in data warehouses or those that arrive online as streams typically have attributes which take values from multiple hierarchies (e.g., time and geographic location; source and destination IP addresses). Providing an aggregate view of such data is important to summarize, visualize, and analyze. We develop the aggregate view based on certain hierarchically organized sets of largevalued regions (“heavy hitters”). Such Hierarchical Heavy Hitters (HHHs) were previously introduced as a crucial aggregation technique in one dimension. In order to analyze the wider range of data warehousing applications and realistic IP data streams, we generalize this problem to multiple dimensions. We identify and study two variants of HHHs for multidimensional data, namely the “overlap ” and “split ” cases, depending on how an aggregate computed for a child node in the multidimensional hierarchy is propagated to its parent element(s). For data warehousing applications, we present offline algorithms that take multiple passes over the data and produce the exact HHHs. For data stream applications, we present online algorithms that find approximate HHHs in one pass, with proven accuracy guarantees. We show experimentally, using real and synthetic data, that our proposed online algorithms yield outputs which are very similar (virtually identical, in many cases) to their offline counterparts. The lattice property of the product of hierarchical dimensions (“diamond”) is crucially exploited in our online algorithms to track approximate HHHs using only a small, fixed number of statistics per candidate node, regardless of the number of dimensions. 1.
A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows
 In Proc. ACM SIGMODSIGACTSIGART Symp. on Princ. of Database Sys. (PODS
, 2006
"... In this paper, we give a simple scheme for identifying εapproximate frequent items over a sliding window of size n. Our scheme is deterministic and does not make any assumption on the distribution of the item frequencies. It supports O(1/ε) update and query time, and uses O(1/ε) space. It is very si ..."
Abstract

Cited by 31 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we give a simple scheme for identifying εapproximate frequent items over a sliding window of size n. Our scheme is deterministic and does not make any assumption on the distribution of the item frequencies. It supports O(1/ε) update and query time, and uses O(1/ε) space. It is very simple; its main data structures are just a few short queues whose entries store the position of some items in the sliding window. We also extend our scheme for variablesize window. This extended scheme uses O(1/ε log(εn)) space.
Shape sensitive geometric monitoring
 In Proc. ACM Symposium on Principles of Database Systems
, 2008
"... A fundamental problem in distributed computation is the distributed evaluation of functions. The goal is to determine the value of a function over a set of distributed inputs, in a communication efficient manner. Specifically, we assume that each node holds a time varying input vector, and we are in ..."
Abstract

Cited by 31 (15 self)
 Add to MetaCart
(Show Context)
A fundamental problem in distributed computation is the distributed evaluation of functions. The goal is to determine the value of a function over a set of distributed inputs, in a communication efficient manner. Specifically, we assume that each node holds a time varying input vector, and we are interested in determining, at any given time, whether the value of an arbitrary function on the average of these vectors crosses a predetermined threshold. In this paper, we introduce a new method for monitoring distributed data, which we term shape sensitive geometric monitoring. It is based on a geometric interpretation of the problem, which enables to define local constraints on the data received at the nodes. It is guaranteed that as long as none of these constraints has been violated, the value of the function does not cross the threshold. We generalize previous work on geometric monitoring, and solve two problems which seriously hampered its performance: as opposed to the constraints used so far, which depend only on the current values of the local input vectors, here we incorporate their temporal behavior into the constraints. Also, the new constraints are tailored to the geometric properties of the specific function which is being monitored, while the previous constraints were generic. Experimental results on real world data reveal that using the new geometric constraints reduces communication by up to three orders of magnitude in comparison to existing approaches, and considerably narrows the gap between existing results and a newly defined lower bound on the communication complexity.
Theory and Practice of Bloom Filters for Distributed Systems
"... Many network solutions and overlay networks utilize probabilistic techniques to reduce information processing and networking costs. This survey article presents a number of frequently used and useful probabilistic techniques. Bloom filters and their variants are of prime importance, and they are h ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
Many network solutions and overlay networks utilize probabilistic techniques to reduce information processing and networking costs. This survey article presents a number of frequently used and useful probabilistic techniques. Bloom filters and their variants are of prime importance, and they are heavily used in various distributed systems. This has been reflected in recent research and many new algorithms have been proposed for distributed systems that are either directly or indirectly based on Bloom filters. In this survey, we give an overview of the basic and advanced techniques, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peertopeer systems, routing and forwarding, and measurement data summarization.
Mining frequent itemsets from data streams with a timesensitive sliding window
 In SDM
, 2005
"... Mining frequent itemsets has been widely studied over the last decade. Past research focuses on mining frequent itemsets from static databases. In many of the new applications, data flow through the Internet or sensor networks. It is challenging to extend the mining techniques to such a dynamic envi ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
(Show Context)
Mining frequent itemsets has been widely studied over the last decade. Past research focuses on mining frequent itemsets from static databases. In many of the new applications, data flow through the Internet or sensor networks. It is challenging to extend the mining techniques to such a dynamic environment. The main challenges include a quick response to the continuous request, a compact summary of the data stream, and a mechanism that adapts to the limited resources. In this paper, we develop a novel approach for mining frequent itemsets from data streams based on a timesensitive sliding window model. Our approach consists of a storage structure that captures all possible frequent itemsets and a table providing approximate counts of the expired data items, whose size can be adjusted by the available storage space. Experiment results show that in our approach both the execution time and the storage space remain small under various parameter settings. In addition, our approach guarantees no false alarm or no false dismissal to the results yielded. 1