Results 1 - 10
of
74
A near-optimal algorithm for computing the entropy of a stream
- In ACM-SIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract
-
Cited by 73 (20 self)
- Add to MetaCart
(Show Context)
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
CSAMP: A System for Network-Wide Flow Monitoring
"... Critical network management applications increasingly demand fine-grained flow level measurements. However, current flow monitoring solutions are inadequate for many of these applications. In this paper, we present the design, implementation, and evaluation of CSAMP, a system-wide approach for flow ..."
Abstract
-
Cited by 46 (11 self)
- Add to MetaCart
(Show Context)
Critical network management applications increasingly demand fine-grained flow level measurements. However, current flow monitoring solutions are inadequate for many of these applications. In this paper, we present the design, implementation, and evaluation of CSAMP, a system-wide approach for flow monitoring. The design of CSAMP derives from three key ideas: flow sampling as a router primitive instead of uniform packet sampling; hash-based packet selection to achieve coordination without explicit communication; and a framework for distributing responsibilities across routers to achieve network-wide monitoring goals while respecting router resource constraints. We show that CSAMP achieves much greater monitoring coverage, better use of router resources, and enhanced ability to satisfy network-wide flow monitoring goals compared to existing solutions. 1
An Empirical Evaluation of Entropy-Based Traffic Anomaly Detection
, 2008
"... Entropy-based approaches for anomaly detection are appealing since they provide more fine-grained insights than traditional traffic volume analysis. While previous work has demonstrated the benefits of entropy-based anomaly detection, there has been little effort to comprehensively understand the de ..."
Abstract
-
Cited by 41 (0 self)
- Add to MetaCart
(Show Context)
Entropy-based approaches for anomaly detection are appealing since they provide more fine-grained insights than traditional traffic volume analysis. While previous work has demonstrated the benefits of entropy-based anomaly detection, there has been little effort to comprehensively understand the detection power of using entropy-based analysis of multiple traffic distributions in conjunction with each other. We consider two classes of distributions: flow-header features (IP addresses, ports, and flow-sizes), and behavioral features (degree distributions measuring the number of distinct destination/source IPs that each host communicates with). We observe that the timeseries of entropy values of the address and port distributions are strongly correlated with each other and provide very similar anomaly detection capabilities. The behavioral and flow size distributions are less correlated and detect incidents that do not show up as anomalies in the port and address distributions. Further analysis using synthetically generated anomalies also suggests that the port and address distributions have limited utility in detecting scan and bandwidth flood anomalies. Based on our analysis, we discuss important implications for entropy-based anomaly detection.
Every Microsecond Counts: Tracking Fine-Grain Latencies with a Lossy Difference Aggregator
"... Many network applications have stringent end-to-end latency requirements, including VoIP and interactive video conferencing, automated trading, and high-performance computing—where even microsecond variations may be intolerable. The resulting fine-grain measurement demands cannot be met effectively ..."
Abstract
-
Cited by 33 (11 self)
- Add to MetaCart
(Show Context)
Many network applications have stringent end-to-end latency requirements, including VoIP and interactive video conferencing, automated trading, and high-performance computing—where even microsecond variations may be intolerable. The resulting fine-grain measurement demands cannot be met effectively by existing technologies, such as SNMP, NetFlow, or active probing. We propose instrumenting routers with a hash-based primitive that we call a Lossy Difference Aggregator (LDA) to measure latencies down to tens of microseconds and losses as infrequent as one in a million. Such measurement can be viewed abstractly as what we refer to as a coordinated streaming problem, which is fundamentally harder than standard streaming problems due to the need to coordinate values between nodes. We describe a compact data structure that efficiently computes the average and standard deviation of latency and loss rate in a coordinated streaming environment. Our theoretical results translate to an efficient hardware implementation at 40 Gbps using less than 1 % of a typical 65-nm 400-MHz networking ASIC. When compared to Poisson-spaced active probing with similar overheads, our LDA mechanism delivers orders of magnitude smaller relative error; active probing requires 50–60 times as much bandwidth to deliver similar levels of accuracy.
A data streaming algorithm for estimating entropies of od flows
- In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
, 2007
"... Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an inte ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an interesting open problem is to measure the entropy of the traffic between every origin-destination pair. In this paper, we propose the first solution to this challenging problem. Our sketch builds upon and extends the Lp sketch of Indyk with significant additional innovations. We present calculations showing that our data streaming algorithm is feasible for high link speeds using commodity CPU/memory at a reasonable cost. Our algorithm is shown to be very accurate in practice via simulations, using traffic traces collected at a tier-1 ISP backbone link.
Compressed counting
- CoRR
"... We propose Compressed Counting (CC) for approximating the αth frequency moments (0 < α ≤ 2) of data streams under a relaxed strict-Turnstile model, using maximallyskewed stable random projections. Estimators based on the geometric mean and the harmonic mean are developed. When α = 1, a simple cou ..."
Abstract
-
Cited by 21 (13 self)
- Add to MetaCart
We propose Compressed Counting (CC) for approximating the αth frequency moments (0 < α ≤ 2) of data streams under a relaxed strict-Turnstile model, using maximallyskewed stable random projections. Estimators based on the geometric mean and the harmonic mean are developed. When α = 1, a simple counter suffices for counting the first moment (i.e., sum). The geometric mean estimator of CC has asymptotic variance ∝ ∆ = |α − 1|, capturing the intuition that the complexity should decrease as ∆ = |α−1 | → 0. However, the previous classical algorithms based on symmetric stable random projections[12, 15] required O ( 1/ɛ 2) space, in order to approximate the αth moments within a 1 + ɛ factor, for any 0 < α ≤ 2 including α = 1. We show ( that using the geometric mean estimator, CC 1 requires O log(1+ɛ) + 2 √ ∆ log3/2 ( √∆)) + o space, as ∆ → (1+ɛ) 0. Therefore, in the neighborhood of α = 1, the complexity of CC is essentially O (1/ɛ) instead of O ( 1/ɛ 2). CC may be useful for estimating Shannon entropy, which can be approximated by certain functions of the αth moments with α → 1. [10, 9] suggested using α = 1 + ∆ with (e.g.,) ∆ < 0.0001 and ɛ < 10 −7, to rigorously ensure reasonable approximations. Thus, unfortunately, CC is “theoretically impractical ” for estimating Shannon entropy, despite its empirical success reported in [16]. 1
1 Linear-time Modeling of Program Working Set in Shared Cache
"... Abstract—Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2) windows in an n-element trace. Two recent techniques have significantl ..."
Abstract
-
Cited by 17 (9 self)
- Add to MetaCart
Abstract—Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2) windows in an n-element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for real-size workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction.
Zero-One Frequency Laws
"... Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AM ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
(Show Context)
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can ∑ i∈[n] G(mi) be approximated efficiently, where “efficiently ” means by a single pass over data stream and poly-logarithmic memory. No such characterization was known despite a tremendous amount of research on frequency-based functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zero-one law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G: R ↦ → R such that G(0) = 0 and G can be computed in poly-logarithmic time and space and ask, for which G in this class is there an (1±ɛ)-approximation algorithm for computing ∑ i∈[n] G(mi) for any polylogarithmic ɛ? We give an algebraic characterization for all such G so that: • For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±ɛ)-approximation algorithm for computing ∑ i∈[n] G(mi) with polylogarithmic memory and a single pass over data stream; while • For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound
Revisiting the Case for a Minimalist Approach for Network Flow Monitoring
"... Network management applications require accurate estimates of a wide range of flow-level traffic metrics. Given the inadequacy of current packet-sampling-based solutions, several application-specific monitoring algorithms have emerged. While these provide better accuracy for the specific application ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
Network management applications require accurate estimates of a wide range of flow-level traffic metrics. Given the inadequacy of current packet-sampling-based solutions, several application-specific monitoring algorithms have emerged. While these provide better accuracy for the specific applications they target, they increase router complexity and require vendors to commit to hardware primitives without knowing how useful they will be to meet the needs of future applications. In this paper, we show using trace-driven evaluations that such complexity and early commitment may not be necessary. We revisit the case for a “minimalist ” approach in which a small number of simple yet generic router primitives collect flow-level data from which different traffic metrics can be estimated. We demonstrate the feasibility and promise of such a minimalist approach using flow sampling and sample-and-hold as sampling primitives and configuring these in a network-wide coordinated fashion using cSamp. We show that this proposal yields better accuracy across a collection of application-level metrics than dividing the same memory resources across metric-specific algorithms. Moreover, because a minimalist approach enables late binding to what applicationlevel metrics are important, it better insulates router implementations and deployments from changing monitoring needs.
Distributed Monitoring of Conditional Entropy for Anomaly Detection in Streams
"... Abstract—In this work we consider the problem of monitoring information streams for anomalies in a scalable and efficient manner. We study the problem in the context of network streams where the problem has received significant attention. Monitoring the empirical Shannon entropy of a feature in a ne ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Abstract—In this work we consider the problem of monitoring information streams for anomalies in a scalable and efficient manner. We study the problem in the context of network streams where the problem has received significant attention. Monitoring the empirical Shannon entropy of a feature in a network packet stream has previously been shown to be useful in detecting anomalies in the network traffic. Entropy is an information-theoretic statistic that measures the variability of the feature under consideration. Anomalous activity in network traffic can be captured by detecting changes in this variability. There are several challenges, however, in monitoring this statistic. Computing the statistic efficiently is non-trivial. Further, when monitoring multiple features, the streaming algorithms proposed previously would likely fail to keep up with the everincreasing channel bandwidth of network traffic streams. There is also the concern that an adversary could attempt to mask the effect of his attacks on variability by a mimicry attack disguising his traffic to mimic the distribution of normal traffic in the network, thus avoiding detection by an entropy monitoring sensor. Also, the high rate of false positives is a big problem with Intrusion Detection Systems, and the case of entropy monitoring is no different. In this work we propose a way to address the above challenges. First, we leverage recent progress in sketching algorithms to develop a distributed approach for computing entropic statistics accurately, at reasonable memory costs. Secondly, we propose monitoring not only regular entropy, but the related statistic of conditional entropy, as a more reliable measure in detecting anomalies. We implement our approach and evaluate it with real data collected at the link layer of an 802.11 wireless network. I.