Results 1 - 10
of
323
The CQL Continuous Query Language: Semantic Foundations and Query Execution
- VLDB Journal
, 2003
"... CQL, a Continuous Query Language, is supported by the STREAM prototype Data Stream Management System at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and updatable relations. We begin by presenting an abstract semantics that relie ..."
Abstract
-
Cited by 354 (4 self)
- Add to MetaCart
(Show Context)
CQL, a Continuous Query Language, is supported by the STREAM prototype Data Stream Management System at Stanford. CQL is an expressive SQL-based declarative language for registering continuous queries against streams and updatable relations. We begin by presenting an abstract semantics that relies only on "black box" mappings among streams and relations.
Building a Better NetFlow
, 2004
"... Network operators need to determine the composition of the traffic mix on links when looking for dominant applications, users, or estimating traffic matrices. Cisco's NetFlow has evolved into a solution that satisfies this need by reporting flow records that summarize a sample of the traffic tr ..."
Abstract
-
Cited by 167 (5 self)
- Add to MetaCart
Network operators need to determine the composition of the traffic mix on links when looking for dominant applications, users, or estimating traffic matrices. Cisco's NetFlow has evolved into a solution that satisfies this need by reporting flow records that summarize a sample of the traffic traversing the link. But sampled NetFlow has shortcomings that hinder the collection and analysis of traffic data. First, during flooding attacks router memory and network bandwidth consumed by flow records can increase beyond what is available; second, selecting the right static sampling rate is difficult because no single rate gives the right tradeoff of memory use versus accuracy for all traffic mixes; third, the heuristics routers use to decide when a flow is reported are a poor match to most applications that work with time bins; finally, it is impossible to estimate without bias the number of active flows for aggregates with non-TCP traffic. In thi paper we propose...
Frenetic: A Network Programming Language
"... Modern networks provide a variety of interrelated services including routing, traffic monitoring, load balancing, and access control. Unfortunately, the languages used to program today’s networks lack modern features—they are usually defined at the low level of abstraction supplied by the underlying ..."
Abstract
-
Cited by 128 (23 self)
- Add to MetaCart
(Show Context)
Modern networks provide a variety of interrelated services including routing, traffic monitoring, load balancing, and access control. Unfortunately, the languages used to program today’s networks lack modern features—they are usually defined at the low level of abstraction supplied by the underlying hardware and they fail to provide even rudimentary support for modular programming. As a result, network programs tend to be complicated, error-prone, and difficult to maintain. This paper presents Frenetic, a high-level language for programming distributed collections of network switches. Frenetic provides a declarative query language for classifying and aggregating network traffic as well as a functional reactive combinator library for describing high-level packet-forwarding policies. Unlike prior work in this domain, these constructs are—by design—fully compositional, which facilitates modular reasoning and enables code reuse. This important property is enabled by Frenetic’s novel runtime system which manages all of the details related to installing, uninstalling, and querying low-level packet-processing rules on physical switches. Overall, this paper makes three main contributions: (1) We analyze the state-of-the art in languages for programming networks and identify the key limitations; (2) We present a language design that addresses these limitations, using a series of examples to motivate and validate our choices; (3) We describe an implementation of the language and evaluate its performance on several benchmarks.
Streaming Pattern Discovery in Multiple Time-Series
- In VLDB
, 2005
"... In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream col ..."
Abstract
-
Cited by 106 (19 self)
- Add to MetaCart
(Show Context)
In this paper, we introduce SPIRIT (Streaming Pattern dIscoveRy in multIple Timeseries) . Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection.
Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles
- In SIGMOD
, 2005
"... While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be sim ..."
Abstract
-
Cited by 103 (24 self)
- Add to MetaCart
(Show Context)
While traditional database systems optimize for performance on one-shot queries, emerging large-scale monitoring applications require continuous tracking of complex aggregates and data-distribution summaries over collections of physically-distributed streams. Thus, effective solutions have to be simultaneously space efficient (at each remote site), communication efficient (across the underlying communication network), and provide continuous, guaranteedquality estimates. In this paper, we propose novel algorithmic solutions for the problem of continuously tracking complex holistic aggregates in such a distributed-streams setting — our primary focus is on approximate quantile summaries, but our approach is more broadly applicable and can handle other holistic-aggregate functions (e.g., “heavy-hitters ” queries). We present the first known distributed-tracking schemes for maintaining accurate quantile estimates with provable approximation guarantees, while simultaneously optimizing the storage space at each remote site as well as the communication cost across the network. In a nutshell, our algorithms employ a combination of local tracking at remote sites and simple prediction models for local site behavior in order to produce highly communication- and space-efficient solutions. We perform extensive experiments with real and synthetic data to explore the various tradeoffs and understand the role of prediction models in our schemes. The results clearly validate our approach, revealing significant savings over naive solutions as well as our analytical worst-case guarantees. 1.
Characterizing Memory Requirements for Queries over Continuous Data
- In PODS
, 2002
"... This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data stre ..."
Abstract
-
Cited by 99 (10 self)
- Add to MetaCart
This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data streams. For queries that can be evaluated using bounded memory, an execution strategy based on constant-sized synopses of the data streams is proposed. For queries that cannot be evaluated using bounded memory, data stream scenarios are identified in which evaluating the queries requires memory linear in the size of the unbounded streams
Fault-tolerance in the Borealis distributed stream processing system
- In Proc. of the 2005 ACM SIGMOD International Conference on Management of Data
, 2005
"... Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e.g., in network security monitorin ..."
Abstract
-
Cited by 97 (9 self)
- Add to MetaCart
(Show Context)
Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e.g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this paper, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE. Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any
Measurements, Analysis, and Modeling of BitTorrent-like Systems
- In Proceedings of the ACM/SIGCOMM Internet Measurement Conference (IMC-05
, 2005
"... Existing studies on BitTorrent systems are single-torrent based, while more than 85 % of all peers participate in multiple torrents according to our trace analysis. In addition, these studies are not sufficiently insightful and accurate even for single-torrent models, due to some unrealistic assumpt ..."
Abstract
-
Cited by 84 (2 self)
- Add to MetaCart
Existing studies on BitTorrent systems are single-torrent based, while more than 85 % of all peers participate in multiple torrents according to our trace analysis. In addition, these studies are not sufficiently insightful and accurate even for single-torrent models, due to some unrealistic assumptions. Our analysis of representative Bit-Torrent traffic provides several new findings regarding the limitations of BitTorrent systems: (1) Due to the exponentially decreasing peer arrival rate in reality, service availability in such systems becomes poor quickly, after which it is difficult for the file to be located and downloaded. (2) Client performance in the BitTorrentlike systems is unstable, and fluctuates widely with the peer population. (3) Existing systems could provide unfair services to peers, where peers with high downloading speed tend to download more and upload less. In this paper, we study these limitations on torrent evolution in realistic environments. Motivated by the analysis and modeling results, we further build a graph based multi-torrent model to study inter-torrent collaboration. Our model quantitatively provides strong motivation for inter-torrent collaboration instead of directly stimulating seeds to stay longer. We also discuss a system design to show the feasibility of multi-torrent collaboration. 1
Adaptive Ordering of Pipelined Stream Filters
, 2004
"... We consider the problem of pipelined filters, where a continuous stream of tuples is processed by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We focus on the problem of ordering the filters adaptively to minimi ..."
Abstract
-
Cited by 83 (13 self)
- Add to MetaCart
(Show Context)
We consider the problem of pipelined filters, where a continuous stream of tuples is processed by a set of commutative filters. Pipelined filters are common in stream applications and capture a large class of multiway stream joins. We focus on the problem of ordering the filters adaptively to minimize processing cost in an environment where stream and filter characteristics vary unpredictably over time. Our core algorithm, A-Greedy (for Adaptive Greedy), has strong theoretical guarantees: If stream and filter characteristics were to stabilize, A-Greedy would converge to an ordering within a small constant factor of optimal. (In experiments A-Greedy usually converges to the optimal ordering.) One very important feature of A-Greedy is that it monitors and responds to selectivities that are correlated across filters (i.e., that are nonindependent), which provides the strong quality guarantee but incurs run-time overhead. We identify a three-way tradeoff among provable convergence to good orderings, run-time overhead, and speed of adaptivity. We develop a suite of variants of A-Greedy that lie at different points on this tradeoff spectrum. We have implemented all our algorithms in the STREAM prototype Data Stream Management System and a thorough performance evaluation is presented.
Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution
, 2004
"... Knowing the distribution of the sizes of traffic flows passing through a network link helps a network operator to characterize network resource usage, infer traffic demands, detect traffic anomalies, and accommodate new traffic demands through better traffic engineering. Previous work on estimating ..."
Abstract
-
Cited by 79 (6 self)
- Add to MetaCart
Knowing the distribution of the sizes of traffic flows passing through a network link helps a network operator to characterize network resource usage, infer traffic demands, detect traffic anomalies, and accommodate new traffic demands through better traffic engineering. Previous work on estimating the flow size distribution has been focused on making inferences from sampled network traffic. Its accuracy is limited by the (typically) low sampling rate required to make the sampling operation affordable. In this paper we present a novel data streaming algorithm to provide much more accurate estimates of flow distribution, using a "lossy data structure" which consists of an array of counters fitted well into SRAM. For each incoming packet, our algorithm only needs to increment one underlying counter, making the algorithm fast enough even for 40 Gbps (OC-768) links. The data structure is lossy in the sense that sizes of multiple flows may collide into the same counter. Our algorithm uses Bayesian statistical methods such as Expectation Maximization to infer the most likely flow size distribution that results in the observed counter values after collision. Evaluations of this algorithm on large Internet traces obtained from several sources (including a tier-1 ISP) demonstrate that it has very high measurement accuracy (within 2%). Our algorithm not only dramatically improves the accuracy of flow distribution measurement, but also contributes to the field of data streaming by formalizing an existing methodology and applying it to the context of estimating the flow-distribution.