### A Hybrid Local and Distributed Sketching Design for Accurate and Scalable Heavy Key Detection in Network Data Streams

"... Abstract Real-time characterization of network traffic anomalies, such as heavy hitters and heavy changers, is critical for the robustness of operational networks, but its accuracy and scalability are challenged by the ever-increasing volume and diversity of network traffic. We address this problem ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract Real-time characterization of network traffic anomalies, such as heavy hitters and heavy changers, is critical for the robustness of operational networks, but its accuracy and scalability are challenged by the ever-increasing volume and diversity of network traffic. We address this problem by leveraging parallelization. We propose LD-Sketch, a data structure designed for accurate and scalable traffic anomaly detection using distributed architectures. LDSketch combines the classical counter-based and sketch-based techniques, and performs detection in two phases: local detection, which guarantees zero false negatives, and distributed detection, which reduces false positives by aggregating multiple detection results. We derive the error bounds and the space and time complexity for LD-Sketch. We further analyze the impact of ordering of data items on the memory usage and accuracy of LD-Sketch. We compare LD-Sketch with state-of-the-art sketch-based techniques by conducting experiments on traffic traces from a real-life 3G cellular data network. Our results demonstrate the accuracy and scalability of LD-Sketch over prior approaches. Note: An earlier conference version of the paper appeared in IEEE INFOCOM 2014 [15]. We extend our proposed design with the consideration of the ordering of data items (Section 5). We also correct the flaws of the lemmas and theorems in our conference version.

### Nonnumerical Algorithms and Problems

"... We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the count-tracking problem, where there are k players, each holding a counter ni that gets incremented over time, and the goal is to track an ε-approximation of their ..."

Abstract
- Add to MetaCart

(Show Context)
We show that randomization can lead to significant improvements for a few fundamental problems in distributed tracking. Our basis is the count-tracking problem, where there are k players, each holding a counter ni that gets incremented over time, and the goal is to track an ε-approximation of their sum n = P i ni continuously at all times, using minimum communication. While the deterministic communication complexity of the problem is Θ(k/ε ·log N), where N is the final value of n when the tracking finishes, we show that with randomization, the communication cost can be reduced to Θ ( √ k/ε · log N). Our algorithm is simple and uses only O(1) space at each player, while the lower bound holds even assuming each player has infinite computing power. Then, we extend our techniques to two related distributed tracking problems: frequency-tracking and rank-tracking, and obtain similar improvements over previous deterministic algorithms. Both problems are of central importance in large data monitoring and analysis, and have been extensively studied in the literature.

### Communication-Efficient Computation on Distributed Noisy Datasets

"... This paper gives a first attempt to answer the following general question: Given a set of machines connected by a point-to-point communication network, each having a noisy dataset, how can we perform communication-efficient statistical estimations on the union of these datasets? Here ‘noisy ’ means ..."

Abstract
- Add to MetaCart

(Show Context)
This paper gives a first attempt to answer the following general question: Given a set of machines connected by a point-to-point communication network, each having a noisy dataset, how can we perform communication-efficient statistical estimations on the union of these datasets? Here ‘noisy ’ means that a real-world entity may appear in different forms in different datasets, but those variants should be considered as the same universe element when perform-ing statistical estimations. We give a first set of communication-efficient solutions for statistical estimations on distributed noisy datasets, including algorithms for distinct elements, L0-sampling, heavy hitters, frequency moments and empirical entropy. 1.

### Vertex Clustering of Augmented Graph Streams

"... In this paper we propose a graph stream clustering algorithm with a unied similarity measure on both structural and attribute properties of vertices, with each attribute being treated as a vertex. Unlike others, our approach does not require an input parameter for the number of clusters, instead, it ..."

Abstract
- Add to MetaCart

(Show Context)
In this paper we propose a graph stream clustering algorithm with a unied similarity measure on both structural and attribute properties of vertices, with each attribute being treated as a vertex. Unlike others, our approach does not require an input parameter for the number of clusters, instead, it dynamically creates new sketch-based clusters and periodically merges existing similar clusters. Experiments on two publicly available datasets reveal the advantages of our approach in detecting vertex clusters in the graph stream. We provide a detailed investigation into how parameters affect the algorithm performance. We also provide a quantitative evaluation and comparison with a well-known offline community detection algorithm which shows that our streaming algorithm can achieve comparable or better average cluster purity. 1

### Improved Practical Matrix Sketching with Guarantees

"... Abstract. Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the er-ror/size tradeoff under various sketching paradigms, we find a simple heuristic iSVD, w ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract. Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the er-ror/size tradeoff under various sketching paradigms, we find a simple heuristic iSVD, with no guarantees, tends to outperform all known ap-proaches. In this paper we adapt the best performing guaranteed algo-rithm, FrequentDirections, in a way that preserves the guarantees, and nearly matches iSVD in practice. We also demonstrate an adversar-ial dataset for which iSVD performs quite poorly, but our new technique has almost no error. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.

### Communication-Efficient and Exact Clustering Distributed Streaming Data

"... A widely used approach to clustering a single data stream is the two-phased approach in which the online phase creates and maintains micro-clusters while the off-line phase generates the macro-clustering from the micro-clusters. We use this approach to propose a distributed framework for clustering ..."

Abstract
- Add to MetaCart

(Show Context)
A widely used approach to clustering a single data stream is the two-phased approach in which the online phase creates and maintains micro-clusters while the off-line phase generates the macro-clustering from the micro-clusters. We use this approach to propose a distributed framework for clustering streaming data. Our proposed framework consists of fundamen-tal processes: one coordinator-site process and many remote-site processes. Remote-site processes can directly communicate with the coordinator-process but cannot communicate the other remote site processes. Every remote-site process generates and maintains micro-clusters that represent cluster information summary, from its local data stream. Remote sites send the local micro-clusterings to the coordinator by the serialization technique, or the coordinator invokes the remote methods in order to get the local micro-clusterings from the remote sites. After the coordinator receives all the local micro-clusterings from the remote sites, it generates the global clustering by the macro-clustering method. Our theoretical and empirical results show that, the global clustering generated by our distributed framework is similar to the clustering generated by the underlying centralized algorithm on the same data set. By using the local micro-clustering approach, our framework achieves high scalability, and communication-efficiency. 1

### MADALGO, Aarhus University

"... Data summarization is an effective approach to dealing with the “big data ” problem. While data summarization problems traditionally have been studied is the streaming model, the focus is starting to shift to distributed models, as distributed/parallel computation seems to be the only viable way to ..."

Abstract
- Add to MetaCart

(Show Context)
Data summarization is an effective approach to dealing with the “big data ” problem. While data summarization problems traditionally have been studied is the streaming model, the focus is starting to shift to distributed models, as distributed/parallel computation seems to be the only viable way to handle today’s massive data sets. In this paper, we study ε-approximations, a classical data summary that, intuitively speaking, preserves approximately the density of the underlying data set over a certain range space. We consider the problem of computing ε-approximations for a data set which is held jointly by k players, and give general communication upper and lower bounds that hold for any range space whose discrepancy is known.

### Summary Data Structures for Massive Data

"... Abstract. Prompted by the need to compute holistic properties of in-creasingly large data sets, the notion of the “summary ” data structure has emerged in recent years as an important concept. Summary struc-tures can be built over large, distributed data, and provide guaranteed performance for a var ..."

Abstract
- Add to MetaCart

(Show Context)
Abstract. Prompted by the need to compute holistic properties of in-creasingly large data sets, the notion of the “summary ” data structure has emerged in recent years as an important concept. Summary struc-tures can be built over large, distributed data, and provide guaranteed performance for a variety of data summarization tasks. Various types of summaries are known: summaries based on random sampling; sum-maries formed as linear sketches of the input data; and other summaries designed for a specific problem at hand. 1

### Quantiles over Data Streams: An Experimental Study

"... A fundamental problem in data management and analysis is to gen-erate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to ..."

Abstract
- Add to MetaCart

A fundamental problem in data management and analysis is to gen-erate descriptions of the distribution of data. It is most common to give such descriptions in terms of the cumulative distribution, which is characterized by the quantiles of the data. The design and engineering of efficient methods to find these quantiles has attracted much study, especially in the case where the data is de-scribed incrementally, and we must compute the quantiles in an online, streaming fashion. Yet while such algorithms have proved to be tremendously useful in practice, there has been limited for-mal comparison of the competing methods, and no comprehensive study of their performance. In this paper, we remedy this deficit by providing a taxonomy of different methods, and describe efficient implementations. In doing so, we propose and analyze variations that have not been explicitly studied before, yet which turn out to perform the best. To illustrate this, we provide detailed experimen-tal comparisons demonstrating the tradeoffs between space, time, and accuracy for quantile computation.