Results 1  10
of
70
A nearoptimal algorithm for computing the entropy of a stream
 In ACMSIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract

Cited by 74 (20 self)
 Add to MetaCart
(Show Context)
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
CSAMP: A System for NetworkWide Flow Monitoring
"... Critical network management applications increasingly demand finegrained flow level measurements. However, current flow monitoring solutions are inadequate for many of these applications. In this paper, we present the design, implementation, and evaluation of CSAMP, a systemwide approach for flow ..."
Abstract

Cited by 46 (11 self)
 Add to MetaCart
(Show Context)
Critical network management applications increasingly demand finegrained flow level measurements. However, current flow monitoring solutions are inadequate for many of these applications. In this paper, we present the design, implementation, and evaluation of CSAMP, a systemwide approach for flow monitoring. The design of CSAMP derives from three key ideas: flow sampling as a router primitive instead of uniform packet sampling; hashbased packet selection to achieve coordination without explicit communication; and a framework for distributing responsibilities across routers to achieve networkwide monitoring goals while respecting router resource constraints. We show that CSAMP achieves much greater monitoring coverage, better use of router resources, and enhanced ability to satisfy networkwide flow monitoring goals compared to existing solutions. 1
An Empirical Evaluation of EntropyBased Traffic Anomaly Detection
, 2008
"... Entropybased approaches for anomaly detection are appealing since they provide more finegrained insights than traditional traffic volume analysis. While previous work has demonstrated the benefits of entropybased anomaly detection, there has been little effort to comprehensively understand the de ..."
Abstract

Cited by 38 (0 self)
 Add to MetaCart
(Show Context)
Entropybased approaches for anomaly detection are appealing since they provide more finegrained insights than traditional traffic volume analysis. While previous work has demonstrated the benefits of entropybased anomaly detection, there has been little effort to comprehensively understand the detection power of using entropybased analysis of multiple traffic distributions in conjunction with each other. We consider two classes of distributions: flowheader features (IP addresses, ports, and flowsizes), and behavioral features (degree distributions measuring the number of distinct destination/source IPs that each host communicates with). We observe that the timeseries of entropy values of the address and port distributions are strongly correlated with each other and provide very similar anomaly detection capabilities. The behavioral and flow size distributions are less correlated and detect incidents that do not show up as anomalies in the port and address distributions. Further analysis using synthetically generated anomalies also suggests that the port and address distributions have limited utility in detecting scan and bandwidth flood anomalies. Based on our analysis, we discuss important implications for entropybased anomaly detection.
Every Microsecond Counts: Tracking FineGrain Latencies with a Lossy Difference Aggregator
"... Many network applications have stringent endtoend latency requirements, including VoIP and interactive video conferencing, automated trading, and highperformance computing—where even microsecond variations may be intolerable. The resulting finegrain measurement demands cannot be met effectively ..."
Abstract

Cited by 31 (11 self)
 Add to MetaCart
(Show Context)
Many network applications have stringent endtoend latency requirements, including VoIP and interactive video conferencing, automated trading, and highperformance computing—where even microsecond variations may be intolerable. The resulting finegrain measurement demands cannot be met effectively by existing technologies, such as SNMP, NetFlow, or active probing. We propose instrumenting routers with a hashbased primitive that we call a Lossy Difference Aggregator (LDA) to measure latencies down to tens of microseconds and losses as infrequent as one in a million. Such measurement can be viewed abstractly as what we refer to as a coordinated streaming problem, which is fundamentally harder than standard streaming problems due to the need to coordinate values between nodes. We describe a compact data structure that efficiently computes the average and standard deviation of latency and loss rate in a coordinated streaming environment. Our theoretical results translate to an efficient hardware implementation at 40 Gbps using less than 1 % of a typical 65nm 400MHz networking ASIC. When compared to Poissonspaced active probing with similar overheads, our LDA mechanism delivers orders of magnitude smaller relative error; active probing requires 50–60 times as much bandwidth to deliver similar levels of accuracy.
A data streaming algorithm for estimating entropies of od flows
 In IMC ’07: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement
, 2007
"... Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an inte ..."
Abstract

Cited by 31 (7 self)
 Add to MetaCart
Entropy has recently gained considerable significance as an important metric for network measurement. Previous research has shown its utility in clustering traffic and detecting traffic anomalies. While measuring the entropy of the traffic observed at a single point has already been studied, an interesting open problem is to measure the entropy of the traffic between every origindestination pair. In this paper, we propose the first solution to this challenging problem. Our sketch builds upon and extends the Lp sketch of Indyk with significant additional innovations. We present calculations showing that our data streaming algorithm is feasible for high link speeds using commodity CPU/memory at a reasonable cost. Our algorithm is shown to be very accurate in practice via simulations, using traffic traces collected at a tier1 ISP backbone link.
Compressed counting
 CoRR
"... We propose Compressed Counting (CC) for approximating the αth frequency moments (0 < α ≤ 2) of data streams under a relaxed strictTurnstile model, using maximallyskewed stable random projections. Estimators based on the geometric mean and the harmonic mean are developed. When α = 1, a simple cou ..."
Abstract

Cited by 21 (12 self)
 Add to MetaCart
We propose Compressed Counting (CC) for approximating the αth frequency moments (0 < α ≤ 2) of data streams under a relaxed strictTurnstile model, using maximallyskewed stable random projections. Estimators based on the geometric mean and the harmonic mean are developed. When α = 1, a simple counter suffices for counting the first moment (i.e., sum). The geometric mean estimator of CC has asymptotic variance ∝ ∆ = α − 1, capturing the intuition that the complexity should decrease as ∆ = α−1  → 0. However, the previous classical algorithms based on symmetric stable random projections[12, 15] required O ( 1/ɛ 2) space, in order to approximate the αth moments within a 1 + ɛ factor, for any 0 < α ≤ 2 including α = 1. We show ( that using the geometric mean estimator, CC 1 requires O log(1+ɛ) + 2 √ ∆ log3/2 ( √∆)) + o space, as ∆ → (1+ɛ) 0. Therefore, in the neighborhood of α = 1, the complexity of CC is essentially O (1/ɛ) instead of O ( 1/ɛ 2). CC may be useful for estimating Shannon entropy, which can be approximated by certain functions of the αth moments with α → 1. [10, 9] suggested using α = 1 + ∆ with (e.g.,) ∆ < 0.0001 and ɛ < 10 −7, to rigorously ensure reasonable approximations. Thus, unfortunately, CC is “theoretically impractical ” for estimating Shannon entropy, despite its empirical success reported in [16]. 1
1 Lineartime Modeling of Program Working Set in Shared Cache
"... Abstract—Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2) windows in an nelement trace. Two recent techniques have significantl ..."
Abstract

Cited by 17 (9 self)
 Add to MetaCart
Abstract—Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n 2) windows in an nelement trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for realsize workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a lineartime algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of sharedcache performance prediction.
ZeroOne Frequency Laws
"... Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AM ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
(Show Context)
Data streams emerged as a critical model for multiple applications that handle vast amounts of data. One of the most influential and celebrated papers in streaming is the “AMS ” paper on computing frequency moments by Alon, Matias and Szegedy. The main question left open (and explicitly asked) by AMS in 1996 is to give the precise characterization for which functions G on frequency vectors mi (1 ≤ i ≤ n) can ∑ i∈[n] G(mi) be approximated efficiently, where “efficiently ” means by a single pass over data stream and polylogarithmic memory. No such characterization was known despite a tremendous amount of research on frequencybased functions in streaming literature. In this paper we finally resolve the AMS main question and give a precise characterization (in fact, a zeroone law) for all monotonically increasing functions on frequencies that are zero at the origin. That is, we consider all monotonic functions G: R ↦ → R such that G(0) = 0 and G can be computed in polylogarithmic time and space and ask, for which G in this class is there an (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) for any polylogarithmic ɛ? We give an algebraic characterization for all such G so that: • For all functions G in our class that satisfy our algebraic condition, we provide a very general and constructive way to derive an efficient (1±ɛ)approximation algorithm for computing ∑ i∈[n] G(mi) with polylogarithmic memory and a single pass over data stream; while • For all functions G in our class that do not satisfy our algebraic characterization, we show a lower bound
Understanding and Exploiting Network Traffic Redundancy
, 2007
"... Abstract — The Internet carries a vast amount and a wide range of content. Some of this content is more popular, and accessed more frequently, than others. The popularity of content could be quite ephemeral e.g., a Web flash crowd or much more permanent e.g., google.com’s banner. A direct consequ ..."
Abstract

Cited by 8 (4 self)
 Add to MetaCart
(Show Context)
Abstract — The Internet carries a vast amount and a wide range of content. Some of this content is more popular, and accessed more frequently, than others. The popularity of content could be quite ephemeral e.g., a Web flash crowd or much more permanent e.g., google.com’s banner. A direct consequence of the skew in popularity is that, at any time, a fraction of the information carried over the Internet is redundant. We make two contributions in this paper. First, we study the fundamental properties of the redundancy in the information carried over the Internet, with a focus on network edges. We collect traffic traces at two network edge locations – a large university’s access link serving roughly 50,000 users, and a tier1 ISP network link connected to a large data center. We conduct several analyses over this data: What fraction of bytes are redundant? What is the frequency at which strings of bytes repeat across different packets? What is the overlap in the information accessed by distinct groups of endusers? Second, we leverage our measurement observations in the design of a family mechanisms for eliminating redundancy in network traffic and improving the overall network performance. The mechanisms we proposed can improve the available capacity of single network links as well as balance load across multiple network links. I.
Revisiting the Case for a Minimalist Approach for Network Flow Monitoring
"... Network management applications require accurate estimates of a wide range of flowlevel traffic metrics. Given the inadequacy of current packetsamplingbased solutions, several applicationspecific monitoring algorithms have emerged. While these provide better accuracy for the specific application ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Network management applications require accurate estimates of a wide range of flowlevel traffic metrics. Given the inadequacy of current packetsamplingbased solutions, several applicationspecific monitoring algorithms have emerged. While these provide better accuracy for the specific applications they target, they increase router complexity and require vendors to commit to hardware primitives without knowing how useful they will be to meet the needs of future applications. In this paper, we show using tracedriven evaluations that such complexity and early commitment may not be necessary. We revisit the case for a “minimalist ” approach in which a small number of simple yet generic router primitives collect flowlevel data from which different traffic metrics can be estimated. We demonstrate the feasibility and promise of such a minimalist approach using flow sampling and sampleandhold as sampling primitives and configuring these in a networkwide coordinated fashion using cSamp. We show that this proposal yields better accuracy across a collection of applicationlevel metrics than dividing the same memory resources across metricspecific algorithms. Moreover, because a minimalist approach enables late binding to what applicationlevel metrics are important, it better insulates router implementations and deployments from changing monitoring needs.